Weird automatic stopping behavior in TF 2.11 GPU

shankar_B · March 15, 2024, 11:24am

While training a model using TF 2.11, the training gets stopped periodically without any stack trace. Only the statement ‘Killed’ gets written to the console. For Tesla T4 GPU having 16 GB GPU memory, the training stops at epoch 8, while for an A10G GPU with 24 GB GPU memory, the training stops at epoch 11. If the training is then resumed from epoch 11, it again stops at epoch 21, and again at epoch 31. Would like to know if someone else has observed a similar behavior or any other memory leak related issues with this version of TF or keras.
Following are the package versions:
tensorflow 2.11.0 cuda112py39h01bd6f0_0 conda-forge
tensorflow-base 2.11.0 cuda112py39haa5674d_0 conda-forge
tensorflow-estimator 2.11.0 cuda112py39h11d7a3b_0 conda-forge
tensorflow-gpu 2.11.0 cuda112py39h0bbbad9_0 conda-forge
tensorflow-io 0.31.0 pypi_0 pypi
keras 2.11.0 pyhd8ed1ab_0 conda-forge
keras-preprocessing 1.1.2 pyhd8ed1ab_0 conda-forge
Any other ideas to root cause this issue are also welcome.

Renu_Patel · March 22, 2024, 3:04pm

Hi @shankar_B

Welcome to the TensorFlow Forum!

How did you install the TensorFlow in your system? Please refer to this TF install official link to install TensorFlow as per your system OS. Also, Please verify if you have installed the compatible version of CUDA, cuDNN as installed TensorFlow, Python version in your system by checking this TF tested build configuration.

Let us know if the issue still persists. Thank you.