Distributed Training

Kavya_Saxena · July 17, 2023, 7:49am

I started doing distributed training (using 2 GPUs) in TensorFlow and followed the distributed training tutorial online. The code is getting executed, and it calculates the loss for a batch at each GPU, but the model updation is taking a lot of time. What could possibly be wrong?

Laxma_Reddy_Patlolla · July 17, 2023, 6:28pm

Hi @Kavya_Saxena ,

Could you please elaborate on the issue more and, if possible, the steps you are taking(please provide the code or reference link) ?

There could be various reasons for model updating, such as communication overhead, inefficient GPU utilization, hardware limitations, Batch Size, frequency of synchronization,etc.

Thanks.