Keras report NaN for Loss during training for multiclass segmentation

Onur_Ozbek1 · February 28, 2022, 2:19am

Hi.

I am using this Keras tutorial to train a multiclass segmentation model. Instead of using the CIHP dataset, which has 20 classes (28280 training and 5000 validation images), I am using mut1ny’s face/head segmentation dataset, which has 14 classes (5620 Training and 1140 validation images).

Most images in mut1ny’s dataset are synthetic 3D human heads but there are real, in-the-wild human shots as well. I also have a distractor set incorporated into the dataset.

During training, this is what I get:

Epoch 1/50
 415/1404 [=======>......................] - ETA: 14:15 - loss: nan - accuracy: 0.5197

The loss is always nan and accuracy never gets better than 0.5197. The training and validation plots are parallel to the x-axis.

What am I doing wrong? How can I fix this?

Tanya · September 11, 2023, 11:23pm

@Onur_Ozbek1 Welcome to Tensorflow Forum !

The occurrence of NaN loss during the training process usually arises from numerical instability in computations. This instability can result from various factors, including Excessive learning rate, When the learning rate is too elevated, the model may overshoot the optimal point during gradient descent, causing instability in training and resulting in NaN values.

To fix the TensorFlow loss NaN error, you can try the following steps:

Check your input data: Make sure that your input data is clean and free of missing or invalid values. You can use the numpy.isnan() function to check for NaN values in your data.
Use appropriate activation functions: Use activation functions that are appropriate for the problem being solved. For example, if you are solving a classification problem, you should use a sigmoid activation function.
Reduce the learning rate: If the learning rate is too high, reduce it. You can experiment with different learning rates to find the optimal value.
Use gradient clipping: Gradient clipping is a technique that can help to prevent vanishing or exploding gradients.
Use a different loss function: If you are using an incorrect loss function, you can switch to a different loss function that is appropriate for your problem.

Here are some additional tips for avoiding the TensorFlow loss NaN error:

Use a normalized dataset. This will help to prevent the loss function from becoming NaN due to extreme values in the data.
Use a regularizer. Regularizers can help to prevent overfitting and improve the stability of the model.
Use a validation set. A validation set is a set of data that is not used to train the model. It can be used to evaluate the performance of the model and to prevent the loss function from becoming NaN due to overfitting.

I hope this helps! Let us know if any further query