MirroredStrategy to distribute training and related computation over multiple GPUs. I experienced that my code got stuck at a
strategy.gather() call. Upon inspection I noticed that the input
PerReplica tensors to the
gather function, returned from an earlier
strategy.run() call, were all placed on GPU:0 (while I am using 4 GPUs). Further, the
backing_device for all
PerReplica tensors shows the correct device (GPU:0 to GPU:3).
How could this occur? Why are all tensors, as output by
strategy.run() placed on GPU:0? I’m not using any special
Is the fact that all
PerReplica tensors are put on the first device the cause that the
strategy.gather() function gets stuck?
Update: I checked that the
CollectiveAllReduce cross device ops are used with the
MirroredStrategy, using NCCL. Here is a log statement that further backs this up:
INFO : tensorflow::_batch_all_gather : Collective batch_all_gather: 1 all-gathers, num_devices = 4, group_size = 4, implementation = CommunicationImplementation.NCCL,