Can't get TF Dataset to work with Keras ImageDataGenerator.flow_from_directory()

Matthias · November 22, 2021, 11:05am

So far I was using a Keras ImageDataGenerator with flow_from_directory() to train my Keras model with all images from the image class input folders. Now I want to train on multiple GPUs, so it seems I need to use a TensorFlow Dataset object.

Thus I came up with this solution:

keras_model = build_model()
train_datagen = ImageDataGenerator()
training_img_generator = train_datagen.flow_from_directory(
    input_path,
    target_size=(image_size, image_size),
    batch_size=batch_size,
    class_mode="categorical",
)
train_dataset = tf.data.Dataset.from_generator(
    lambda: training_img_generator,
    output_types=(tf.float32, tf.float32),
    output_shapes=([None, image_size, image_size, 3], [None, len(image_classes)])
)
# similar for validation_dataset = ...
keras_model.fit(
    train_dataset,
    steps_per_epoch=train_steps_per_epoch,
    epochs=epoch_count,
    validation_data=validation_dataset,
    validation_steps=validation_steps_per_epoch,
)

Now this seem to work, the model is trained as usual. However, during training I get the following warning message, when using a mirrored strategy:

AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Did not find a shardable source, walked to a node which is not a dataset

So I added the following lines between creating the data sets and calling fit():

options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
train_dataset.with_options(options)
validation_dataset.with_options(options)

However, I still get the same warning.
This leads me to these two questions:

What do I need to do in order to get rid of this warning?
Even more important: Why is TF not able to split the dataset with the default AutoShardPolicy.FILE policy, since I am using thousands of images per class in the input folder?

anon26514083 · November 22, 2021, 12:36pm

Use it like thia

train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
validation_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test))

data_augmentation = tf.keras.Sequential(
    [
        tf.keras.layers.experimental.preprocessing.RandomFlip("horizontal_and_vertical"),
        tf.keras.layers.experimental.preprocessing.RandomRotation(factor=0.02),
        tf.keras.layers.experimental.preprocessing.RandomZoom(
            height_factor=0.2, width_factor=0.2
        ),
    ],
    name="data_augmentation",
)

def preprocess_image(image, label):
    image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))
    image = tf.image.convert_image_dtype(image, tf.float32) / 255.0
    return image, label

# Training Pipeline
pipeline_train = (
    train_ds
    .shuffle(BATCH_SIZE*100)
    .map(preprocess_image, num_parallel_calls=AUTO)
    .batch(BATCH_SIZE)
    .map(lambda x, y: (data_augmentation(x), y), num_parallel_calls=AUTO)
    .prefetch(AUTO)
)

# Validation Pipeline
pipeline_validation = (
    validation_ds
    .map(preprocess_image, num_parallel_calls=AUTO)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

Matthias · November 22, 2021, 12:41pm

Thanks, but we don’t use tensor slices, but images from a directory.
Can’t we use a Dataset with the flow_from_directory() function?

Sayan_Nath · November 22, 2021, 9:11pm

I am not sure that but the most preferred way is to do with tensor slices. Follow this tutorial to get the overall insight.

github.com

sayannath/American-Sign-Language-Detection/blob/master/notebook/ASL_MobileNetV2.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "ASL_MobileNetV2.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {

This file has been truncated. show original

Matthias · November 25, 2021, 11:18am

Ok, so I ended up using your notebook.
However, tihs leads to exactly the same warning when using a mirrored strategy

Sayan_Nath · November 25, 2021, 11:45am

What is the warning?

Matthias · November 25, 2021, 12:38pm

AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: “TensorSliceDataset/_2”

Even though I tried it with these options:

options = tfd.Options()
options.experimental_distribute.auto_shard_policy = tfd.experimental.AutoShardPolicy.DATA
training_dataset.with_options(options)
validation_dataset.with_options(options)

The warning is still the same.

Matthias · November 25, 2021, 12:51pm

Here is the full source code:

gist.github.com

https://gist.github.com/haimat/d5f179b23e61c2b80ba424f988b90c9e

keras-multi-gpu.py

# Train a Keras model on multiple GPUs in parallel, using TF Dataset slices.

from tensorflow.keras.applications import VGG16
from tensorflow.keras.applications import vgg16
from tensorflow.keras import models
from tensorflow.keras import layers

from sklearn.preprocessing import LabelEncoder
from collections import Counter
from imutils import paths

This file has been truncated. show original

anon26514083 · November 29, 2021, 1:32pm

Thanks will look into this.