Distributed Training with different GPU models

Curtis_To · August 1, 2023, 12:24pm

Hi,

We are looking to buy new GPUs for our lab.
Suppose we purchase an A6000 today and another A6000 Ada later, would I be able to utilize both GPUs efficiently?
I guess the core of the problem is what are the implications during distributed training with one slower and one faster GPU.

Much appreciated.

Kiran_Sai_Ramineni · August 3, 2023, 6:47am

Hi @Curtis_To, You can use different GPUs for distributed training. During distributed training each GPU receives a portion of the data and computes the gradients independently. Then, the gradients are averaged across GPUs to update the model parameters. In this process if the slower GPU takes more time to complete its computation,it will cause a delay in the overall training progress because the faster GPU have to wait for time until the slower GPU completes its computation. Thank You!

Curtis_To · August 3, 2023, 8:45am

Thank you for your prompt response.
So would you recommend getting exactly the same model of GPU for distributed training?

I am also guessing using SLI on different GPU models is not advised.
On a tangent, does that mean federated learning with different nodes also requires exactly the same GPU models, as the whole learning processing is bottlenecked by the slowest GPU?

Kiran_Sai_Ramineni · September 22, 2023, 7:35am

Hi @Curtis_To, As per my knowledge, In synchronous federated learning, each time the server will waits for the device that finishes last to update the model and then aggregates the updates to the shared central model.

Whereas in Asynchronous federated learning the nodes do not wait for other nodes to complete their computations before they proceed with their local updates. Each node computes its local gradients and can immediately send them to a central server.

In Synchronous whole learning processing will be bottle-necked by the slowest GPU. Thank You.