All PerReplica Tensors on device GPU:0, backing_device is correct

I’m using MirroredStrategy to distribute training and related computation over multiple GPUs. I experienced that my code got stuck at a strategy.gather() call. Upon inspection I noticed that the input PerReplica tensors to the gather function, returned from an earlier strategy.run() call, were all placed on GPU:0 (while I am using 4 GPUs). Further, the backing_device for all PerReplica tensors shows the correct device (GPU:0 to GPU:3).

How could this occur? Why are all tensors, as output by strategy.run() placed on GPU:0? I’m not using any special cross_device_ops.

Is the fact that all PerReplica tensors are put on the first device the cause that the strategy.gather() function gets stuck?

Update: I checked that the CollectiveAllReduce cross device ops are used with the MirroredStrategy, using NCCL. Here is a log statement that further backs this up:

INFO : tensorflow::_batch_all_gather : Collective batch_all_gather: 1 all-gathers, num_devices = 4, group_size = 4, implementation = CommunicationImplementation.NCCL,

Some further information:

>>> self._distribution_strategy
<tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7f68ae375220>

>>> self._distribution_strategy._extended._cross_device_ops
<tensorflow.python.distribute.cross_device_ops.CollectiveAllReduce object at 0x7f68ae3754c0>

>>> str(self._distribution_strategy._extended._cross_device_ops._options)
‘Options(bytes_per_pack=0,timeout_seconds=None, implementation=CommunicationImplementation.NCCL)’

Example of the PerReplica input tensors to strategy.gather, as returned by strategy.run:

>>> metric_inputs
(PerReplica:{
0: <tf.Tensor: shape=(1,), dtype=int32, numpy=array([1], dtype=int32)>,
1: <tf.Tensor: shape=(1,), dtype=int32, numpy=array([1], dtype=int32)>,
2: <tf.Tensor: shape=(1,), dtype=int32, numpy=array([1], dtype=int32)>,
3: <tf.Tensor: shape=(1,), dtype=int32, numpy=array([1], dtype=int32)>
}, PerReplica:{
0: <tf.Tensor: shape=(1,), dtype=int64, numpy=array([0])>,
1: <tf.Tensor: shape=(1,), dtype=int64, numpy=array([0])>,
2: <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1])>,
3: <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1])>
})

The second replica input tensor, see how device is GPU:0 and backing_device is GPU:2 (the correct device):

>>> metric_inputs[1].values[2].device
‘/job:localhost/replica:0/task:0/device:GPU:0’

>>> metric_inputs[1].values[2].backing_device
‘/job:localhost/replica:0/task:0/device:GPU:2’

Why are the tensors transported to GPU:0 after strategy.run while using the MirroredStrategy in combination with CollectiveAllReduce using NCCL?

Further, is this why strategy.gather gets stuck?