I’m relatively new to tensorflow so please excuse me if this is a
Here is what I’m trying to do:
I have 2 separate Datasets:
- melanoma_ds: contains 10000 true positive cases (Tensorflow dataset)
- no_melanoma_ds: contains 10000 true negative cases (Tensorflow dataset)
I would like to concatenate these two datasets and do a shuffle afterwards.
train_ds = no_melanoma_ds.concatenate(melanoma_ds)
My problem is the shuffle.
I want to have a well shuffled train dataset so I tried to use:
train_ds = train_ds.shuffle(20000)
I’m using google colab and it seems like I ran out of graphic card memory (limit of 11GB).
→ colab session crashes
So I tried to pick smaller batches (5000 instead of 20000) from my two datasets
melanoma_ds shuffle them with a buffer_size of only 5000 and concatenate all of them afterwards:
def memory_efficient_shuffle(melanoma_ds=melanoma_ds, no_melanoma_portion=no_melanoma_portion): shuffle_rounds = 4 tmp_start = 0 batch_each_class = 2500 final_shuffle_ds = None for i in range(shuffle_rounds): tmp_start = batch_each_class * i tmp_melanoma_ds = melanoma_ds.skip(tmp_start).take(batch_each_class) tmp_no_melanoma_ds = no_melanoma_portion.skip(tmp_start).take(batch_each_class) both_portions_ds = tmp_melanoma_ds.concatenate(tmp_no_melanoma_ds) shuffled_portion_ds = both_portions_ds.shuffle(5000) final_shuffled_ds = shuffled_portion_ds if final_shuffle_ds == None else final_shuffle_ds.concatenate(shuffled_portion_ds) return final_shuffled_ds
This actually works and the session does not crash…
But if I try to pick the first element of the shuffled dataset it takes too much time and i don’t know if the program will ever terminate.
final_shuffled_ds = memory_efficient_shuffle() image, label = next(iter(final_shuffled_ds))
I bet I made a lot of mistakes during the whole process:D
I really would like to know how would you approach this?