"[4,240,240,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc" while training

I am programming my model using a 3D-Unet architecture, 51GB of RAM and Nvdia V100 with 16GB of RAM as a GPU. These are under the google colab enviroment.

The images that compose the dataset are MRI images in nifti format, and the size is (1, 240, 240, 160, 1). They are grayscale and I use 1 as a batch size to see if it solves the problem (it doesn’t)

Here is my code:

def load_nifti_image(filepath):
    nifti = nib.load(filepath)
    volume = nifti.get_fdata()
    return volume

# -----------------------TRAIN-----------------------
nifti_files = [os.path.join("/content/drive/MyDrive/Interpolated/train/images", f) for f in os.listdir("/content/drive/MyDrive/Interpolated/train/images") if f.endswith('.nii.gz')]
mask_files = [os.path.join("/content/drive/MyDrive/Interpolated/train/masks", f) for f in os.listdir("/content/drive/MyDrive/Interpolated/train/masks") if f.endswith('.nii.gz')]

nifti_images = [load_nifti_image(f) for f in nifti_files]
nifti_masks = [load_nifti_image(f) for f in mask_files]

final_nifti_images = [np.expand_dims(image, axis=-1) for image in nifti_images]
final_nifti_masks = [np.expand_dims(image, axis=-1) for image in nifti_masks]

dataset = tf.data.Dataset.from_tensor_slices((final_nifti_images, final_nifti_masks))

# -----------------------VALIDATION-----------------------
nifti_files_val = [os.path.join("/content/drive/MyDrive/Interpolated/validation/images", f) for f in os.listdir("/content/drive/MyDrive/Interpolated/validation/images") if f.endswith('.nii.gz')]
mask_files_val = [os.path.join("/content/drive/MyDrive/Interpolated/validation/masks", f) for f in os.listdir("/content/drive/MyDrive/Interpolated/validation/masks") if f.endswith('.nii.gz')]

nifti_images_val = [load_nifti_image(f) for f in nifti_files_val]
nifti_masks_val = [load_nifti_image(f) for f in mask_files_val]

final_nifti_images_val = [np.expand_dims(image, axis=-1) for image in nifti_images_val]
final_nifti_masks_val = [np.expand_dims(image, axis=-1) for image in nifti_masks_val]

dataset_val = tf.data.Dataset.from_tensor_slices((final_nifti_images_val, final_nifti_masks_val))

dataset = dataset.batch(1)
dataset_val = dataset_val.batch(1)

test_model.fit(dataset, validation_data=dataset_val, epochs=100)

I tried limiting the RAM of the GPU to 14GB, it does not work either, and also reducing the batch size. Note that I cannot change the model since I am using a predefined 3D-Unet model, I will paste it here too, maybe it helps:


# Convolutional Block
def conv_block(inputs, num_filters):
    x = Conv3D(num_filters, (3, 3, 3), padding = "same")(inputs)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)

    x = Conv3D(num_filters, (3, 3, 3), padding = "same")(x)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)

    return x

# Encoder Block
def encoder_block(inputs, num_filters):
    x = conv_block(inputs, num_filters)
    p = MaxPool3D((2, 2, 2), padding="same")(x)
    return x, p

# Decoder Block
def decoder_block(inputs, skip, num_filters):
    x = Conv3DTranspose(num_filters, (2, 2, 2), strides=2, padding="same")(inputs)
    x = Concatenate()([x, skip])
    x = conv_block(x, num_filters)
    return x

# UNET

def unet(input_shape):
    inputs = Input(input_shape)

    "----ENCODER----"
    s1, p1 = encoder_block(inputs, 64)
    s2, p2 = encoder_block(p1, 128)
    s3, p3 = encoder_block(p2, 256)
    s4, p4 = encoder_block(p3, 512)

    "----BRIDGE---"
    b1 = conv_block(p4, 1024)

    "----DECODER----"
    d1 = decoder_block(b1, s4, 512)
    d2 = decoder_block(d1, s3, 256)
    d3 = decoder_block(d2, s2, 128)
    d4 = decoder_block(d3, s1, 64)

    outputs = Conv3D(1, 1, padding="same", activation="sigmoid")(d4)

    model = Model(inputs, outputs, name="UNET")
    return model

input_shape = (240, 240, 160, 1)

test_model = unet(input_shape)
optimizer = Adam(learning_rate=0.0001)
test_model.compile(optimizer=optimizer, loss=dice_coefficient_loss, metrics=[dice_coefficient])

@matca

It seems like you are loading the entire dataset into memory before creating the tf.data.Dataset. For large datasets, consider using the tf.data API to create a generator function that loads and preprocesses data on-the-fly. This can help in reducing memory usage.

Create a generator function that yields batches of data during training. This ensures that only a batch of data is loaded into memory at a time, reducing memory consumption. Here’s how you can do that:

def data_generator(image_files, mask_files, batch_size):
    for img_file, mask_file in zip(image_files, mask_files):
        image = np.expand_dims(load_nifti_image(img_file), axis=-1)
        mask = np.expand_dims(load_nifti_image(mask_file), axis=-1)
        yield image, mask

train_generator = data_generator(nifti_files, mask_files, batch_size=1)
val_generator = data_generator(nifti_files_val, mask_files_val, batch_size=1)

dataset = tf.data.Dataset.from_generator(lambda: train_generator, output_signature=(tf.TensorSpec(shape=(None, 240, 240, 160, 1), dtype=tf.float32), tf.TensorSpec(shape=(None, 240, 240, 160, 1), dtype=tf.float32)))
dataset_val = tf.data.Dataset.from_generator(lambda: val_generator, output_signature=(tf.TensorSpec(shape=(None, 240, 240, 160, 1), dtype=tf.float32), tf.TensorSpec(shape=(None, 240, 240, 160, 1), dtype=tf.float32)))

If memory issues persist, try further reducing the batch size. A smaller batch size will consume less memory but might increase training time.

I tried creating patches like this:

def load_nifti_image(filepath, patch_size=(48, 48, 32), step_size=(48, 48, 32)):
    nifti = nib.load(filepath)
    volume = nifti.get_fdata()
    
    # Create patches from the volume
    patches = patchify(volume, patch_size, step=step_size)
    
    # Reshape patches multiplying (5, 5, 5) and add channel dimension (1 for grayscale)
    patches = patches.reshape(-1, *patches.shape[-3:])
    patches = np.expand_dims(patches, axis=-1)
    
    return patches

# Load and patchify images
nifti_images = [load_nifti_image(f) for f in nifti_files]
nifti_masks = [load_nifti_image(f) for f in mask_files]

# Flatten the list of patches for each image into a single list of patches
final_nifti_images = [patch for image in nifti_images for patch in image]
final_nifti_masks = [patch for image in nifti_masks for patch in image]

dataset = tf.data.Dataset.from_tensor_slices((final_nifti_images, final_nifti_masks))

dataset = dataset.batch(1)

Now it seems like its training, aproximately 8 minutes per epoch. What do you think about this solution compared to yours? Is it that your solution is more efficcient?

@matca
I’m glad to hear that the patch-based solution is working well for you, achieving a reasonable training time of around 8 minutes per epoch. I like your approach of creating patches to handle large 3D images. Feel free to reach out if you have any more questions or need further assistance.

@BadarJaffer Bad news, I did not realize I was training with a subset of the dataset that I created for testing purposes, anyway, I reached about 89% of accuracy in validation with only 40 images and without data augmentation.
But now I have another problem, it seems that the dataset is too big for the RAM, I try your approach but the epochs are pretty long (I ran them for about 2 hours and it did not finished the first epoch) do you now a way to accelerate the process? I was thinking about distributed processing via cloud, but maybe there is and easier approach (Also I tried with a 32 batch size but it did not work either)

@BadarJaffer I just “finish” training, but this is what happend:

Epoch 1/100
1907/1907 [==============================] - 2108s 1s/step - loss: 0.7836 - dice_coefficient: 0.2164 - val_loss: 0.8514 - val_dice_coefficient: 0.1484
Epoch 2/100

WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 190700 batches). You may need to use the repeat() function when building your dataset.
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 356 batches). You may need to use the repeat() function when building your dataset.

1907/1907 [==============================] - 0s 42us/step - loss: 0.0000e+00 - dice_coefficient: 0.0000e+00

<keras.src.callbacks.History at 0x7dbca0fcb0d0>

This was with the data generator approach

  • Ensure GPU acceleration is activated in Colab: Runtime -> Change runtime type -> GPU.
  • Experiment with distributed training or multiple GPUs to boost training speed. Here’s a simple example using TensorFlow MirroredStrategy:
# Wrap your model creation and compilation inside this strategy
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    test_model = unet(input_shape)
    optimizer = Adam(learning_rate=0.0001)
    test_model.compile(optimizer=optimizer, loss=dice_coefficient_loss, metrics=[dice_coefficient])
  • Confirm that your dataset generator can generate enough batches using the repeat() function:
dataset = dataset.repeat().batch(batch_size).prefetch(tf.data.AUTOTUNE)
  • Verify that the number of samples in your dataset is correctly specified.

  • Optimize data loading and preprocessing with generators. Ensure your augmentation techniques are efficient.

  • Explore mixed-precision training:

from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)
  • Double-check your validation dataset setup. Make sure it contains sufficient data for validation during each epoch.

@matca try it out and let me know if it works.