Running out of memory while performing model training

I am using .npy files, the masks are all 140MB with dimesions (240, 240, 160, 4) and the images are 70MB with dimesions (240. 240, 160, 1). I am using a data generator because the images are 3D and are quite large, but I can’t figure out why the RAM is consumed so quick.

Given my calculations (probably wrong) this is what every image consumes:
240 * 240 * 160 * 1 * 4 bytes = 29,491,200 bytes = ~28.125 MB
240 * 240 * 160 * 4 * 4 bytes = 117,964,800 bytes = ~112.5 MB

Here my model (3D Unet):

# Convolutional Block
def conv_block(inputs, num_filters):
    x = Conv3D(num_filters, (3, 3, 3), padding = "same")(inputs)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)

    x = Conv3D(num_filters, (3, 3, 3), padding = "same")(x)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)

    return x

# Encoder Block
def encoder_block(inputs, num_filters):
    x = conv_block(inputs, num_filters)
    p = MaxPool3D((2, 2, 2), padding="same")(x)
    return x, p

# Decoder Block
def decoder_block(inputs, skip, num_filters):
    x = Conv3DTranspose(num_filters, (2, 2, 2), strides=2, padding="same")(inputs)
    x = Concatenate()([x, skip])
    x = conv_block(x, num_filters)
    return x


def unet(input_shape):
    inputs = Input(input_shape)

    s1, p1 = encoder_block(inputs, 16)
    s2, p2 = encoder_block(p1, 32)
    s3, p3 = encoder_block(p2, 64)
    s4, p4 = encoder_block(p3, 128)

    b1 = conv_block(p4, 256)

    d1 = decoder_block(b1, s4, 128)
    d2 = decoder_block(d1, s3, 64)
    d3 = decoder_block(d2, s2, 32)
    d4 = decoder_block(d3, s1, 16)

    outputs = Conv3D(4, 1, padding="same", activation="softmax")(d4)

    model = Model(inputs, outputs, name="UNET")
    return model

# Input shape has to be divisible by 2**4 = 16
# Original input shape is (240, 240, 155) but an interpolation was done
# After the interpolation, we divided the images in patches
test_input_shape = (240, 240, 160, 1)

test_model = unet(test_input_shape)
test_optimizer = Adam(learning_rate=0.0001)
test_model.compile(optimizer=test_optimizer, loss=dice_loss, metrics=metrics)

And here is the data generator:

def data_generator(nifti_files, mask_files):
    for nifti_file, mask_file in zip(nifti_files, mask_files):
        nifti_image = np.load(nifti_file)
        nifti_mask = np.load(mask_file)

        yield (nifti_image, nifti_mask)

# Create datasets
dataset =
    lambda: data_generator(train_volumes, train_masks),
        tf.TensorSpec(shape=(240, 240, 160, 1), dtype=tf.float32),
        tf.TensorSpec(shape=(240, 240, 160, 4), dtype=tf.float32)

dataset_val =
    lambda: data_generator(val_volumes, val_masks),
        tf.TensorSpec(shape=(240, 240, 160, 1), dtype=tf.float32),
        tf.TensorSpec(shape=(240, 240, 160, 4), dtype=tf.float32)

# Batch and prefetch
dataset = dataset.batch(1).prefetch(
dataset_val = dataset_val.batch(1).prefetch(

Hi @matca ,

Could you please try below suggestions which might help you to reduce the memory consumption while training.

Your RAM consumption while training can be lowered by optimizing data loading, reducing model memory footprint, and adjusting processing strategies.

Data Loading:

  • Load data directly from disk using memory-mapped files.
  • Perform data augmentation on-the-fly within the generator.
  • Use smaller batches to reduce memory footprint per training step.

Model Optimization:

  • Reduce model memory usage with mixed precision training.
  • Manually trigger garbage collection to release unused memory.

Processing Strategies:

  • Lazy load data only when needed in the generator.
  • Use TensorBoard Profiler to identify memory-intensive operations.

I hope this will help you!


Running out of memory during model training, especially with large 3D images, is a common issue due to the high memory demands of storing and processing these images. Your calculation of memory usage per image and mask seems correct, but remember, during training, TensorFlow not only loads these images but also stores intermediate activations, gradients, and other variables for each layer of the model, significantly increasing memory usage.

Here are some strategies to mitigate the memory issue:

  1. Batch Size: Ensure you are using a small batch size in your data generator. Even a single image and its corresponding mask consume a significant amount of memory, so you’ll want to keep the batch size as small as feasible.
  2. Patch Training: Instead of feeding the entire 3D image into the network, consider dividing the image into smaller 3D patches. This can drastically reduce memory consumption. You’ll need to adjust your data generator to yield these patches instead of the whole image.
  3. Model Complexity: 3D U-Net is inherently memory-intensive due to its depth and the 3D convolutions. Consider reducing the number of filters in each layer or simplifying the architecture if possible.
  4. Gradient Accumulation: If reducing the batch size affects the model convergence, consider using gradient accumulation. It allows you to effectively use a larger batch size without increasing memory consumption by accumulating gradients over several forward passes before performing a single backward pass.
  5. Mixed Precision Training: Using mixed precision can reduce memory usage significantly by utilizing float16 for certain computations and storage, while keeping critical parts of the model in float32 to maintain accuracy.
  6. Memory Profiling: Tools like TensorFlow’s Profiler can help identify where memory bottlenecks are occurring, guiding you on what parts of your model or data pipeline might need optimization.
  7. Data Loading and Augmentation: Ensure your data loading and augmentation steps are efficient and don’t unnecessarily duplicate data in memory. Using TensorFlow’s API efficiently can help with this.
  8. Hardware Considerations: If possible, training on a machine with more RAM or using a distributed training approach across multiple machines could alleviate memory constraints.

By implementing these strategies, you should be able to reduce memory consumption and train your model more effectively. It’s often a process of trial and error to find the right balance of model complexity, batch size, and training efficiency for your specific dataset and hardware setup.