How to concatenate and shuffle two tensorflow dataset with 10000 records each without running out of memory (11GB)

Hey,
I’m relatively new to tensorflow so please excuse me if this is a

beginner question:D

Here is what I’m trying to do:
I have 2 separate Datasets:

  1. melanoma_ds: contains 10000 true positive cases (Tensorflow dataset)
  2. no_melanoma_ds: contains 10000 true negative cases (Tensorflow dataset)

I would like to concatenate these two datasets and do a shuffle afterwards.
train_ds = no_melanoma_ds.concatenate(melanoma_ds)

My problem is the shuffle.

I want to have a well shuffled train dataset so I tried to use:
train_ds = train_ds.shuffle(20000)
I’m using google colab and it seems like I ran out of graphic card memory (limit of 11GB).
→ colab session crashes

So I tried to pick smaller batches (5000 instead of 20000) from my two datasets no_melanoma_ds
and melanoma_ds shuffle them with a buffer_size of only 5000 and concatenate all of them afterwards:

def memory_efficient_shuffle(melanoma_ds=melanoma_ds, no_melanoma_portion=no_melanoma_portion):
  shuffle_rounds = 4
  tmp_start = 0
  batch_each_class = 2500
  final_shuffle_ds = None

  for i in range(shuffle_rounds):
    tmp_start = batch_each_class * i
    tmp_melanoma_ds = melanoma_ds.skip(tmp_start).take(batch_each_class)
    tmp_no_melanoma_ds = no_melanoma_portion.skip(tmp_start).take(batch_each_class)
    both_portions_ds = tmp_melanoma_ds.concatenate(tmp_no_melanoma_ds)
    shuffled_portion_ds = both_portions_ds.shuffle(5000)
    
    final_shuffled_ds = shuffled_portion_ds if final_shuffle_ds == None else final_shuffle_ds.concatenate(shuffled_portion_ds)
  return final_shuffled_ds

This actually works and the session does not crash…
But if I try to pick the first element of the shuffled dataset it takes too much time and i don’t know if the program will ever terminate.

final_shuffled_ds = memory_efficient_shuffle()
image, label = next(iter(final_shuffled_ds))

I bet I made a lot of mistakes during the whole process:D
I really would like to know how would you approach this?

try using float32 data units instead of float64 this could save your space issue.

1 Like

Hi Timo_v,

Is your data set images? If so, you can use ImageDataGenerator to load and shuffle the data for you.
If it is not images, you can assign each piece of data a unique identifier and shuffle the array of identifiers. You can then lazily load each piece of data as you iterate the the array.
On top of that, Colab has pro versions that may allow you to have increased memory.

2 Likes

Good question @Timo_v and welcome to TF Forum! Looping in @markdaoust

1 Like

Okay.

The main thing to remember here is that shuffle runs in memory. So this loads all 20000 images into memory.

Remember that .skip still loads all the data it just throws the first N on the floor.
And that it still has to load at least 5k images before it returns the first batch.

Yes, if you have loose image files. But prefer tf.keras.utils.image_dataset_from_directory.

If you don’t have directories of image files … what do you have? Where are these melanoma_ds and no_melanoma_ds coming from?

3 Likes

It’s is already stored in the float32 format.

yes you are right my data set contains images.
Thank you I will give it a try:)!

Thank you for this detailed explanation!

The data is in the form of tensorflow records.
This is the link to the dataset: https://www.kaggle.com/cdeotte/melanoma-384x384
It contains a lot of information but I only use the images and the corresponding labels.

Oh It’s TFRecord files.

So what you want to do then is not try to shuffle all the images together, but shuffle the list of files (Dataset.list_files shuffles the order each epoch). And then do a smaller shuffle of the individual examples. Start with something like this:

# list_files shuffles the order each iteration
ds = tf.data.Dataset.list_files("trian*")
ds = ds.interleave(tf.data.TFRecordDataset, ...)
ds = ds.shuffle(...)

Ref: Dataset.interleave

2 Likes