MultiWorkerMirroredStrategy with distributed dataset question

Hi! I’m trying to train a model with the imagenet2012 using a cluster of three workers. This cluster has three nodes, all with at least one GPU, but the hard-drive size and memory is limited for two of the nodes, having a storage capacity of 100Gb in total, which makes it impossible to store the imagenet dataset there, so it is stored in the remaining node that has 2TB storage.

My problem is the following: I would like to train a model using the MultiWorkerMirroredStrategy involving the three nodes I mentioned, but the dataset is only available to one of the workers, which is a no-no for the other two nodes since the function requires a dataset to train the model with. Is it possible to execute this training strategy with these settings? Maybe I’m missing something regarding the configuration of the dataset distribution or the model fitting function? Thank you all in advance!

Hi @alopezia97, As per my knowledge, you can use distributed training.Even the dataset was present in the one worker which is referred to as the first worker or chief. when you use tf.distribute.experimental.MultiWorkerMirroredStrategy autosharding a dataset takes place means that each worker is assigned a subset of the entire dataset during training. so each worker receives a subset of data to train. Thank You.

Hi @Kiran_Sai_Ramineni, thank you for your answer! Maybe I’m not understanding correctly how Multiworkermirroredstrategy works: I thought all nodes had to execture the same code, including the training part of the script, maybe that’s not the case?
Thank you again!