Effective learning rate when using tf.distribute.MirroredStrategy (one host, multi-GPU)

marcocintra · March 18, 2024, 3:21pm

Hi!

When using tf.distribute.MirroredStrategy (one host, multi-GPU) the effective learning rate is the desired learning rate scaled by the number of GPUs (multiplying the learning rate by the number of GPUs) or is just the learning rate desired when using just one GPU?

For example, if I want an learning rate = 1E-3 when using 1 GPU, I just use learning rate = 1E-3 (without using tf.distribute.MirroredStrategy); if I use tf.distribute.MirroredStrategy with 8 GPUs should I set learning rate = 8E-3 (8 * 1E-3), the same way I should multiply the batch size by 8 when I’m scaling to 8 GPUs, or should I just use 1E-3 as the learning rate?

Thanks in advance!

Tim_Wolfe · March 18, 2024, 6:08pm

No, when using tf.distribute.MirroredStrategy with multiple GPUs, you don’t automatically scale the learning rate by the number of GPUs. You start with the same learning rate as for a single GPU and adjust based on your observations. Scaling the learning rate is a heuristic that may help but requires experimentation.

marcocintra · March 18, 2024, 7:00pm

Ok, I think this is the reason why in this TensorFlow guide (https://www.tensorflow.org/guide/distributed_training#use_tfdistributestrategy_with_keras_modelfit, see end of the section) it is described that when scaling to N GPUs it will be necessary to tune the learning rate, but DEPENDING ON THE MODEL, I think this follow what you’ve said (which may or may not be necessary), so it is not a rule, correct? Thanks!