Confusion regarding how tf.keras.preprocessing.image_dataset_from_directory works

Hello TensorFlow developers,

I encountered a rather strange behavior of tf.keras.preprocessing.image_dataset_from_directory function and I was wondering if you can clarify things for me. The model I’m working with is based on this example.

In my code, I load the data like so:

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    "MyDataset",
    validation_split=0.2,
    subset="training",
    seed=1337,
    image_size=image_size,
    batch_size=batch_size,)

val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    "MyDataset",
    validation_split=0.2,
    subset="validation",
    seed=1337,
    image_size=image_size,
    batch_size=batch_size,)

I run the data loading cell above only once.

I then go on to train my model while saving the weights at each epoch. After training, I use the model’s training history to pick the weights which achieved the highest validation accuracy.

Here’s where the strange behavior occurs: I want to load the best model weights and calculate the classification metrics (accuracy, F1 etc.) of the loaded model. Here I’m copy/pasting the few Jupyter Notebook cells (and their outputs) related to this:

model = keras.models.load_model('save_at_1246.h5')

from sklearn.metrics import classification_report,confusion_matrix
import numpy as np

images_list = []
labels_list = []
for images, labels in val_ds.take(1):
    images_list.append(images)
    labels_list.append(labels.numpy())

predictions = np.argmax(model.predict(images_list), axis=-1)
predictions = predictions.reshape(1,-1)[0]
print("labels_list:")
print(labels_list)
print("numpy unique counts:")
print(np.unique(labels_list, return_counts=True))

print(classification_report(labels_list[0], predictions, target_names = ['1 (Class 0)','2 (Class 1)','3 (Class 2)','4 (Class 3)']))
labels_list:
[array([0, 3, 1, 1, 2, 3, 2, 2, 1, 0, 3, 1, 2, 0, 2, 1, 3, 2, 1, 1, 1, 1,
       1, 1, 3, 3, 2, 1, 2, 3, 1, 1])]
numpy unique counts:
(array([0, 1, 2, 3]), array([ 3, 14,  8,  7], dtype=int64))
              precision    recall  f1-score   support

 1 (Class 0)       1.00      0.67      0.80         3
 2 (Class 1)       0.82      1.00      0.90        14
 3 (Class 2)       0.71      0.62      0.67         8
 4 (Class 3)       0.67      0.57      0.62         7

    accuracy                           0.78        32
   macro avg       0.80      0.72      0.75        32
weighted avg       0.78      0.78      0.77        32

However, when I run the cell again (I’ll copy/paste the code again), I get:

from sklearn.metrics import classification_report,confusion_matrix
import numpy as np

images_list = []
labels_list = []
for images, labels in val_ds.take(1):
    images_list.append(images)
    labels_list.append(labels.numpy())

predictions = np.argmax(model.predict(images_list), axis=-1)
predictions = predictions.reshape(1,-1)[0]
print("labels_list:")
print(labels_list)
print("numpy unique counts:")
print(np.unique(labels_list, return_counts=True))

print(classification_report(labels_list[0], predictions, target_names = ['1 (Class 0)','2 (Class 1)','3 (Class 2)','4 (Class 3)']))
labels_list:
[array([3, 2, 3, 1, 2, 1, 1, 3, 2, 1, 2, 1, 1, 3, 3, 2, 2, 1, 3, 2, 0, 1,
       2, 2, 2, 3, 1, 3, 0, 1, 2, 1])]
numpy unique counts:
(array([0, 1, 2, 3]), array([ 2, 11, 11,  8], dtype=int64))
              precision    recall  f1-score   support

 1 (Class 0)       0.67      1.00      0.80         2
 2 (Class 1)       0.83      0.91      0.87        11
 3 (Class 2)       0.67      0.73      0.70        11
 4 (Class 3)       0.60      0.38      0.46         8

    accuracy                           0.72        32
   macro avg       0.69      0.75      0.71        32
weighted avg       0.71      0.72      0.70        32

Notice the discrepancy between numpy unique counts in the output. The first one has [ 3, 14, 8, 7] as the label distribution, while the second one has [ 2, 11, 11, 8] as the label distribution. I did not expect this behavior. I did expect the data samples in val_ds to be shuffled (because I didn’t provide shuffle=False parameter to the constructor), but what bothers me is that the numpy unique counts isn’t the same when I re-run the cell again. Mind you, I only ran the cell that creates val_ds once.

I have two questions on this:

  1. Why is this happening and is there a way for me to get my desired behavior, that is, to be able to get the same data samples (albeit maybe not in the same order) with tf.keras.preprocessing.image_dataset_from_directory?
  2. If tf.keras.preprocessing.image_dataset_from_directory works the way I described here, does it mean that during training with model.fit() there’s an overlap between training and validation datasets?

Thank you in advance!