Data Leakage - image_dataset_from_directory()

EmmaK · March 5, 2024, 11:18am

Hi, I’m trying to build a very basic cnn to do multiclass image classification and getting a little stuck on one of the first steps of splitting the data! Following a youtube tutorial I initially created a dataset using tf.keras.utils.image_dataset_from_directory() then split it into train/valid/test using .skip() and .take(). While the model worked great I noticed that the test set changed each time I used it (even if I only had 1 batch). My understanding is that doing this method, every time you use all the data, it reshuffles and redraws all of the samples. So, 1. Is this then a source of data leakage? In that as you train the CNN, at each epoch it redraws the samples and hence the model has already seen the test set?

As a result I decided to try creating a separate directory for test data that I don’t touch and just doing the train/validation split. Now, from reading online i realised I could just do this using validation_split keyword… However, that brings up the second question where if I do the split using validation split (Method 1 below) I only get validation accuracy up to about 0.5 during training, whereas if I do the skip(), take() method (Method 2) I can get up to 0.95… I’m clearly doing something different with these two methods but can’t see it. Could anyone explain what it is? And which method is better?

## METHOD 1 ##

validation_split = 0.2

train1 = tf.keras.utils.image_dataset_from_directory(
              train_dir,
              validation_split = validation_split,
              subset = "training",
              seed = RANDOM_STATE)

val1 = tf.keras.utils.image_dataset_from_directory(
              train_dir,
              validation_split = validation_split,
              subset = "validation",
              seed = RANDOM_STATE)

train1 = train1.map(lambda x,y: (x/255.,y))
val1 = val1.map(lambda x,y: (x/255.,y))

## METHOD 2
data = tf.keras.utils.image_dataset_from_directory(train_dir)

# Scale the pixel data to between 0 and 1
data = data.map(lambda x,y: (x/255.,y))
          
# Split into train, validation and test samples
n_batchs = len(data)
train_size = int(n_batchs*0.8)
val_size = int(n_batchs*0.2)

# if rounding causes sizes to be less than amount of data, add spare data to the training set
total_size = train_size+val_size
if total_size < n_batchs:
  train_size += n_batchs - total_size

train2 = data.take(train_size)
val2 = data.skip(train_size).take(val_size)

Thank you so much for any help!

Kiran_Sai_Ramineni · April 23, 2024, 10:03am

Hi @EmmaK, By default tf.keras.utils.image_dataset_from_directory will shuffle the data as the shuffle argument is set to True. Due to this while splitting some times the same image can be in train and val dataset also. It is recommended to use method 1. Thank You.