Adding GPU mid-training

affectsai · May 12, 2022, 1:26pm

I am in the middle of training a model, which is taking about 2 hours per epoch, running 100 epochs, so I’m looking at about 8 days. To speed things along, I will be adding a second GPU to my workstation later today. I am relatively new to this, so had a question about doing so.

My model is compiled using a MirroredStrategy. My read of the documentation leads me to believe that I will need to recompile the model with an updated devices list after I add the second GPU. Is this correct, or will the model use the second GPU without recompiling?

Assuming I need to recompile the model, will the training still resume from the most recent checkpoint? Or will recompiling with additional GPUs invalidate the existing checkpoints?

Obviously I can find out the answers to both of these through experimentation later after I add the GPU – but I have stake holders who are interested to know as soon as possible what the impact is here.