CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY)

Dear TensorFlow Community,

My model trains fine on the GPU with a dataset containing 25 hours of audio. However, when I use a 200-hour audio dataset, I am encountering the following error:

E tensorflow/compiler/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:306] gpu_async_0 cuMemAllocAsync failed to allocate 957465616 bytes: CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY)

I am currently using TensorFlow 2.14.0, Python 3.10, CUDA 11.8, CuDNN 8.7.0.84, Nvidia driver 535.129.03, Ubuntu 22.04.03.

GPU Name Persistence-M Bus-Id Disp.A Volatile Uncorr. ECC
NVIDIA GeForce RTX 4090 Off 00000000:01:00.0 On Off
GPU FAN 42% 68°C P2 328W / 450W GPU Memory-Usage 16491MiB / 24564MiB GPU-Util 100% Default
Processes GPU GI CI PID Type Process name GPU Memory
0 N/A N/A 2158 G /usr/lib/xorg/Xorg 392MiB
0 N/A N/A 2333 G /usr/bin/gnome-shell 62MiB
0 N/A N/A 3625 G /usr/lib/firefox/firefox 165MiB
0 N/A N/A 5937 G SpareRendererForSitePerProcess 112MiB
0 N/A N/A 18689 G /usr/bin/nvidia-settings 0MiB
0 N/A N/A 20854 G gnome-control-center 6MiB
0 N/A N/A 29875 C /usr/bin/python 15718MiB

I have tried decreasing the complexity of the model (from 60M to 20M parameters) and reducing the batch size. When I reduced the batch size from 48 to 16, the error occurred a few minutes later. I suspect that something might be wrong with my tf.data.Dataset input pipeline. Here is a snippet of the dataset input pipeline:

# Set up parameters
batch_size = 32
num_epochs = 20

buffer_size = 1000

# Training dataset
train_dataset = tf.data.Dataset.from_tensor_slices(
  ( list(dataframe_training["wav_file_name"]), list(dataframe_training["transcription"]) )
)
train_dataset = (
  train_dataset.map(feature_extractor, num_parallel_calls=tf.data.AUTOTUNE)
  .shuffle(buffer_size)
  .padded_batch(batch_size)
  .prefetch(buffer_size=tf.data.AUTOTUNE)
)

# Validation dataset
validation_dataset = tf.data.Dataset.from_tensor_slices(
  ( list(dataframe_validation["wav_file_name"]), list(dataframe_validation["transcription"]) )
)
validation_dataset = (
  validation_dataset.map(feature_extractor, num_parallel_calls=tf.data.AUTOTUNE)
  .shuffle(buffer_size)
  .padded_batch(batch_size)
  .prefetch(buffer_size=tf.data.AUTOTUNE)
)

Are the ‘train_dataset’ and ‘validation_dataset’ not copied in batches to the GPU?

Hi @mar_ml, Could you please try by further decreasing the batch size to 8, 4 and let us know if you are facing the same issue or not? Thank you.

1 Like

Yes, I’m still encountering the same issue even when I reduce the batch size to 4. When I test with a batch size of 4, the GPU memory starts stable at 5618 / 24564 MiB for a few minutes. Afterward, it rapidly increases to 12468 / 24564 MiB. Finally, approximately 10 minutes into model.fit(), I encounter an ‘Out-Of-Memory’ error. Additionally, I have experimented with setting environment variables ‘TF_FORCE_GPU_ALLOW_GROWTH=true’ and ‘TF_CUDNN_RESET_RND_GEN_STATE=true’.

Epoch 1/10
1198/9855 [==>...........................] - ETA: 1:08:34 - loss: 186.38292023-11-08 14:06:39.784376: E tensorflow/compiler/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:306] gpu_async_0 cuMemAllocAsync failed to allocate 934502416 bytes: CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY)
 Reported by CUDA: Free memory/Total memory: 121307136/25385107456

Good luck on finding a solution to Tensorflow Out Of Memory (OOM) issues. I’ve been trying to find a solution to this for over a year and I have no idea of any viable solution paths. I changed my data structure from NumPy arrays to Tensorflow dataset objects and have also reduced my batch size to one, but the OOM error still persists. No matter which batch size I choose, tensorflow seems to gobble up all the memory as can be seen with nvidi-smi. Apparently, Tensorflow is very greedy, probably for optimization purposes. At this point, I’m thinking about switching over to PyTorch.

1 Like

The feature_extractor setup seems like the most likely culprit from what you have provided. Have you tried profiling to look for large tensor allocations?

Example:

from tensorflow.keras.callbacks import TensorBoard

log_dir = "logs/profile"
tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=1, profile_batch='500,520')

model.fit(train_dataset, epochs=num_epochs, callbacks=[tensorboard_callback])

1 Like