XLA JIT Compile causing Memory Leak

loneil · February 14, 2024, 2:03pm

Hi,

I’m currently working with TensorFlow 2.13.0, with CUDA 11.8.

I’ve noticed that when enabling XLA , i.e., by passing in jit_compile=True to my Keras model’s compile() method, that I get a decent speed up, GPU utilisation is much closer to 100%. But after some variable number epochs my training is killed, as system memory usage has been exceeded.

This seems to be caused by a memory leak, which is maybe related to the following warning I get printed during the compilation process:

2024-02-14 09:15:25.535561: I tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:328] ptxas warning : Registers are spilled to local memory in function 'fusion_92', 16 bytes spill stores, 16 bytes spill loads

My model code that would replicate this error is unfortunately proprietary, so before I invest time and effort in trying to produce a repro in e.g., colab, I was wondering if:
A) This is a known issue with existing repro and someone can link me to a github issue to track for this?
B) There’s already a solution and I can e.g., upgrade TF and/or CUDA version to fix this?

Thanks in advance,
Liam

Alexandre_Moritz · February 14, 2024, 8:40pm

I had the same issue. This is a known bug: Use of Keras `jit_compile` in a distribution strategy causes a `std::system_error` · Issue #56423 · tensorflow/tensorflow · GitHub