What AutoShardPolicy to use for distributed training with multiple workers?

I would like to train my model with multiple workers. However I only have very limited memory (RAM) to work with. Which AutoShardPolicy is recommended here?

I have tried AutoShardPolicy.DATA but I ran into memory issues.
In the documentation it says:
"Each worker will process the whole dataset and discard the portion that is not for itself. "

As I understand it, the whole dataset needs to be small enough to fit in the memory in the first place. Is this correct?

Then I would have to use AutoShardPolicy.FILE, however I am not sure how to use this.
Assuming I have multiple files, how would they have to look like (I am working with the CIFAR-10 dataset) ? And how would I create a file-based dataset?

Any help is appreciated!

Multi-worker distributed training with a Keras model and the Model.fit API using the tf.distribute.MultiWorkerMirroredStrategy API. With the help of this strategy, a Keras model that was designed to run on a single-worker can seamlessly work on multiple workers with minimal code changes.

I am using Model.fit and tf.distribute.MultiWorkerMirroredStrategy!
However I have a different problem. I want to distribute my data to the workers and I’m not sure that it works as I imagine.

As I understand it, tf.data.experimental.AutoShardPolicy is responsible for distributing the data. How come I don’t see any difference in training time and memory usage between the settings AutoShardPolicy.DATA and AutoShardPolicy.OFF? Am I not supposed to notice a difference?