Is there an automatic way to restart a model.fit automatically

I am training a relatively volatile model and the loss will often become NaN. I know that if I was better at ML I could set it to not become NaN, but for now I just watch the output and restart it if it breaks. I was wondering if there was a way to set it so that if the loss or Val_loss became NaN or other values to restart fitting so I didn’t have to watch it and restart it manually every time something breaks. If it isn’t something that already exists I think this would be a very interesting thing to add in order to quickly train a model or even to set up a loop that will repeatedly train multiple models within a loop.

There are a number of reasons why your loss may be outputting NaN. A couple of things you could try;

  1. Try normalizing your input data
  2. Check the validity of your data to ensure there are no missing data
  3. Use large batch sizes if possible
  4. Use less complex model architecture.
  5. You may want to reduce dropout rate if you are using them or including regularization
  6. etc

The point is, you would normally want to tweak your model,model parameters and input data when you hit a stump (llike loss being NaN) and it will be in your best interest to do those changes manually. What use would restarting training be if you are just going to use the same build (model and hyperparameters) which will eventually lead to the model breaking all over again. I believe this is why people are not investing in doing something like that.

Regardless, if you are looking for ways to do what you are suggesting, you could consider writing your own training loop following this link. That way you have all the leeway to implement how and when training should be done. However if the problem stems from your input data, implementing something like that would prove itself to be a huge task and you’ll just be better off restarting your model manually.
Essentially there are too many factors at play which can lead to your training breaking.

I stongly suggest you work on improving your model and training parameters. You can use keras tuner for that. There are other tools such as optuna that help you tweak you hyperparameters as well.

The reason restarting the model would be useful for me is because about 1/3 of the time it will completely train correctly with no NaN loss, but if my loss does fail it is within the first 1/2 of the first Epoch. I agree that I should just tune my parameters so that my model doesn’t crash. I still think that do to the stochastic nature of ML it would be useful to be able to restart a model training quickly and automatically.
Thank you for the advice!

You’re welcome and I wish you well in your advanceements in AI/ML.