Train a model built-on "custom dataloader" with multi-GPU support

Abhijeet · February 25, 2022, 3:31pm

I am interested to scale the existing model with a “custom data loader” built on tensorflow.keras.utils.Sequence for multi-GPU support. Can anybody share a few thoughts?
The “custom data loader” is built on tensorflow.keras.utils.Sequence as opposed to tf.dataset because of the nature of the dataset.

Following code is a minimal example.

gist.github.com

https://gist.github.com/bzamecnik/dcc1d1a39f3e4fa7ac5733d80b79fa2d#file-keras_mnist_sequence-py

keras_mnist_sequence.py

"""
Title: Simple MNIST convnet
Author: [fchollet](https://twitter.com/fchollet)
Date created: 2015/06/19
Last modified: 2020/04/21
Description: A simple convnet that achieves ~99% test accuracy on MNIST.


Modified:
- data wrapped as a Sequence

This file has been truncated. show original

keras_sequence_fork.log

new process 28285
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
main process 28285
TRAIN MnistSequence.__init__(), PID: 28285 len: 58
VAL MnistSequence.__init__(), PID: 28285 len: 9
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #

This file has been truncated. show original

keras_sequence_spawn.log

new process 27656
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
main process 27656
TRAIN MnistSequence.__init__(), PID: 27656 len: 58
VAL MnistSequence.__init__(), PID: 27656 len: 9
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #

This file has been truncated. show original

There are more than three files. show original

The above example uses multiprocessing with a "custom data loader " on a single node with multiple CPUs. Is there a way I can scale it for a multi-GPU mirrored strategy with a “custom data loader” like in the example?

I dig a bit but most of the examples in official documentation use tf.dataset for multi-GPU training which makes little complicated to adapt.

lgusm · February 28, 2022, 12:42pm

HI,

I didn’t understand exactly what you want but let me add my 2 cents.
if you want to train the model using multi-gpu, you might look into distribution strategies not on data loaders

on the data side of things, you might want to be as efficient as possible with your cpu and have a very good pipeline like you can see here: Better performance with the tf.data API | TensorFlow Core

(sorry if I misunderstood your question)

Abhijeet · February 28, 2022, 4:29pm

To use a distribution strategy, data must be pipelined in a distributed way. Most of the examples shown, used the tf.data API also uses well-known datasets from the TensorFlow datasets. But if the dataset is built from a custom loader like above (using tensorflow.keras.utils.Sequence), then things may change during distributing data across multiple GPUs. I just want to know what’s the right way to do those things.
One way to do this is tf.data.Dataset.from_generator but something not working out

seq_iter_tr = lambda: (s for s in MnistSequence(x_train, y_train, batch_size, 'TRAIN'))
    seq_iter_ts = lambda: (s for s in MnistSequence(x_test, y_test, batch_size, 'VAL'))

    seq_train = tf.data.Dataset.from_generator(seq_iter_tr,output_signature=(
        tf.TensorSpec(shape=(batch_size,28, 28, 1) , dtype=tf.string),
        tf.TensorSpec(shape=(batch_size, num_classes), dtype=tf.string )))
    seq_test = tf.data.Dataset.from_generator(seq_iter_ts, output_signature=(
        tf.TensorSpec(shape=(batch_size,28, 28, 1) , dtype=tf.string),
        tf.TensorSpec(shape=(batch_size,num_classes), dtype=tf.string )))

Getting error for shape