How does MultiWorkerMirroredStrategy works?

wzz · April 18, 2022, 1:10am

We are training to run a distributed training on cluster with just CPUs. After reading the tutorials we choose to use tf.distribute.MultiWorkerMirroredStrategy. But there are something confusing us. It said that I need to prepare the same code on every work and this strategy will send all the model, checkpoint and dataset to every worker. But dose It sent the data of the sample or just the index of every sample? Do I need to prepare model, checkpoint and whole dataset on every worker? I hope that the chef worker can load all data from itself and sent what others need to every worker so that other workers don’t have to prepare training data. It’s not easy for us to put all data on every worker limit to cluster using rules of our business.
We try to only load checkpoint on chef worker and the program doesn’t work. We also try to load the whole dataset on chef worker and load just a part of dataset on other workers and it doesn’t work.