Can't get TF Dataset to work with Keras ImageDataGenerator.flow_from_directory()

So far I was using a Keras ImageDataGenerator with flow_from_directory() to train my Keras model with all images from the image class input folders. Now I want to train on multiple GPUs, so it seems I need to use a TensorFlow Dataset object.

Thus I came up with this solution:

keras_model = build_model()
train_datagen = ImageDataGenerator()
training_img_generator = train_datagen.flow_from_directory(
    input_path,
    target_size=(image_size, image_size),
    batch_size=batch_size,
    class_mode="categorical",
)
train_dataset = tf.data.Dataset.from_generator(
    lambda: training_img_generator,
    output_types=(tf.float32, tf.float32),
    output_shapes=([None, image_size, image_size, 3], [None, len(image_classes)])
)
# similar for validation_dataset = ...
keras_model.fit(
    train_dataset,
    steps_per_epoch=train_steps_per_epoch,
    epochs=epoch_count,
    validation_data=validation_dataset,
    validation_steps=validation_steps_per_epoch,
)

Now this seem to work, the model is trained as usual. However, during training I get the following warning message, when using a mirrored strategy:

AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Did not find a shardable source, walked to a node which is not a dataset

So I added the following lines between creating the data sets and calling fit():

options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
train_dataset.with_options(options)
validation_dataset.with_options(options)

However, I still get the same warning.
This leads me to these two questions:

  1. What do I need to do in order to get rid of this warning?
  2. Even more important: Why is TF not able to split the dataset with the default AutoShardPolicy.FILE policy, since I am using thousands of images per class in the input folder?

Use it like thia

train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
validation_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test))
data_augmentation = tf.keras.Sequential(
    [
        tf.keras.layers.experimental.preprocessing.RandomFlip("horizontal_and_vertical"),
        tf.keras.layers.experimental.preprocessing.RandomRotation(factor=0.02),
        tf.keras.layers.experimental.preprocessing.RandomZoom(
            height_factor=0.2, width_factor=0.2
        ),
    ],
    name="data_augmentation",
)
def preprocess_image(image, label):
    image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))
    image = tf.image.convert_image_dtype(image, tf.float32) / 255.0
    return image, label
# Training Pipeline
pipeline_train = (
    train_ds
    .shuffle(BATCH_SIZE*100)
    .map(preprocess_image, num_parallel_calls=AUTO)
    .batch(BATCH_SIZE)
    .map(lambda x, y: (data_augmentation(x), y), num_parallel_calls=AUTO)
    .prefetch(AUTO)
)

# Validation Pipeline
pipeline_validation = (
    validation_ds
    .map(preprocess_image, num_parallel_calls=AUTO)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

Thanks, but we don’t use tensor slices, but images from a directory.
Can’t we use a Dataset with the flow_from_directory() function?

I am not sure that but the most preferred way is to do with tensor slices. Follow this tutorial to get the overall insight.

Ok, so I ended up using your notebook.
However, tihs leads to exactly the same warning when using a mirrored strategy :frowning:

What is the warning?

AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: “TensorSliceDataset/_2”

Even though I tried it with these options:

options = tfd.Options()
options.experimental_distribute.auto_shard_policy = tfd.experimental.AutoShardPolicy.DATA
training_dataset.with_options(options)
validation_dataset.with_options(options)

The warning is still the same.

Here is the full source code:

Thanks will look into this.

1 Like