Xla (jit_compile flag) and gpu memory usage

,

We are observing unexplained out of GPU memory events when trying to train a complex large model (involving conditional execution) and enabling XLA (jit_compile=True flag for tf.function).

Unfortunately, we haven’t been able to reproduce the issue in a reduced shareable form just yet, so I am writing here mostly for feedback.

What we see:

  • In general GPU utilization for an XLA compiled model goes considerably down compared to non-compiled graph mode or eager execution. This is what we measure in most of our models and in all our small test-cases.
  • However, in some instances, large models exceed memory capacity when compiled while they can still run in eager mode for the same exact batch size.

Both behaviors seem contradictory, so we are wondering if there are some known corner cases involving XLA that may produce this and we can avoid (we really need the extra train efficiency coming out of XLA).

Thanks!

1 Like

Hi Ramon,

1- I don’t know the answer, but did you watch this: How to make TensorFlow models run faster on GPUs - YouTube
you might get some insight there

I’ll ping some people to get you more insights (no promises)

Hi Ramon,

Yes, in general XLA should make things faster, but sometimes there are no guarantees. Bugs where it makes things a lot slower are welcome.

George

1 Like

Thanks Gus! The video was very informative, I had not seen this one. However it did not help explain the memory usage increase we are observing. Looking forward to hear experience from others. Thanks!

Hi George,

In this case performance does not seem to be the issue, but instead increased GPU memory usage. Is there a list of known problems that we may be hitting? Or maybe best practices in terms of operators? I would love to file a bug/issue on this, but as I mentioned in the original post, so far we have been unable to isolate the behavior in a shareable form.

Thanks again!
Ramon.

It’s hard to say at TF level.

You can dump the buffer assignment, or maybe visualize the HLO graph, and try to see where the memory is going, and then go from there.

1 Like

Hi George,

Is there a resource (website, doc, etc) showing the steps how to visualize HLO graph ? I’m unable to find anything comprehensive except for a couple of graphviz issues posts.

There is a an usage of the tool at:

2 Likes