CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY)

mar_ml · November 8, 2023, 4:09am

Dear TensorFlow Community,

My model trains fine on the GPU with a dataset containing 25 hours of audio. However, when I use a 200-hour audio dataset, I am encountering the following error:

E tensorflow/compiler/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:306] gpu_async_0 cuMemAllocAsync failed to allocate 957465616 bytes: CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY)

I am currently using TensorFlow 2.14.0, Python 3.10, CUDA 11.8, CuDNN 8.7.0.84, Nvidia driver 535.129.03, Ubuntu 22.04.03.

GPU Name	Persistence-M	Bus-Id	Disp.A	Volatile Uncorr. ECC
NVIDIA GeForce RTX 4090	Off	00000000:01:00.0	On	Off
GPU FAN 42%	68°C	P2	328W / 450W	GPU Memory-Usage 16491MiB / 24564MiB	GPU-Util 100%	Default

GPU	GI	CI	PID	Type	GPU Memory
N/A	N/A	2158	G	/usr/lib/xorg/Xorg	392MiB
N/A	N/A	2333	G	/usr/bin/gnome-shell	62MiB
N/A	N/A	3625	G	/usr/lib/firefox/firefox	165MiB
N/A	N/A	5937	G	SpareRendererForSitePerProcess	112MiB
N/A	N/A	18689	G	/usr/bin/nvidia-settings	0MiB
N/A	N/A	20854	G	gnome-control-center	6MiB
N/A	N/A	29875	C	/usr/bin/python	15718MiB

I have tried decreasing the complexity of the model (from 60M to 20M parameters) and reducing the batch size. When I reduced the batch size from 48 to 16, the error occurred a few minutes later. I suspect that something might be wrong with my tf.data.Dataset input pipeline. Here is a snippet of the dataset input pipeline:

# Set up parameters
batch_size = 32
num_epochs = 20

buffer_size = 1000

# Training dataset
train_dataset = tf.data.Dataset.from_tensor_slices(
( list(dataframe_training["wav_file_name"]), list(dataframe_training["transcription"]) )
)
train_dataset = (
train_dataset.map(feature_extractor, num_parallel_calls=tf.data.AUTOTUNE)
.shuffle(buffer_size)
.padded_batch(batch_size)
.prefetch(buffer_size=tf.data.AUTOTUNE)
)

# Validation dataset
validation_dataset = tf.data.Dataset.from_tensor_slices(
( list(dataframe_validation["wav_file_name"]), list(dataframe_validation["transcription"]) )
)
validation_dataset = (
validation_dataset.map(feature_extractor, num_parallel_calls=tf.data.AUTOTUNE)
.shuffle(buffer_size)
.padded_batch(batch_size)
.prefetch(buffer_size=tf.data.AUTOTUNE)
)

Are the ‘train_dataset’ and ‘validation_dataset’ not copied in batches to the GPU?

Kiran_Sai_Ramineni · November 8, 2023, 10:53am

Hi @mar_ml, Could you please try by further decreasing the batch size to 8, 4 and let us know if you are facing the same issue or not? Thank you.

mar_ml · November 8, 2023, 1:12pm

Yes, I’m still encountering the same issue even when I reduce the batch size to 4. When I test with a batch size of 4, the GPU memory starts stable at 5618 / 24564 MiB for a few minutes. Afterward, it rapidly increases to 12468 / 24564 MiB. Finally, approximately 10 minutes into model.fit(), I encounter an ‘Out-Of-Memory’ error. Additionally, I have experimented with setting environment variables ‘TF_FORCE_GPU_ALLOW_GROWTH=true’ and ‘TF_CUDNN_RESET_RND_GEN_STATE=true’.

Epoch 1/10
1198/9855 [==>...........................] - ETA: 1:08:34 - loss: 186.38292023-11-08 14:06:39.784376: E tensorflow/compiler/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:306] gpu_async_0 cuMemAllocAsync failed to allocate 934502416 bytes: CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY)
 Reported by CUDA: Free memory/Total memory: 121307136/25385107456

Bill · November 10, 2023, 5:34pm

Good luck on finding a solution to Tensorflow Out Of Memory (OOM) issues. I’ve been trying to find a solution to this for over a year and I have no idea of any viable solution paths. I changed my data structure from NumPy arrays to Tensorflow dataset objects and have also reduced my batch size to one, but the OOM error still persists. No matter which batch size I choose, tensorflow seems to gobble up all the memory as can be seen with nvidi-smi. Apparently, Tensorflow is very greedy, probably for optimization purposes. At this point, I’m thinking about switching over to PyTorch.

Daniel_Curtis · November 11, 2023, 12:30am

The feature_extractor setup seems like the most likely culprit from what you have provided. Have you tried profiling to look for large tensor allocations?

Example:

from tensorflow.keras.callbacks import TensorBoard

log_dir = "logs/profile"
tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=1, profile_batch='500,520')

model.fit(train_dataset, epochs=num_epochs, callbacks=[tensorboard_callback])