Custom loss producing nans after a few epochs

I am working on a regression model to extract some parameters from measured curves. The loss function I want to use is a combination of the MSE between the regressed parameters AND the MSE between the true input curve and the curve produced by the regressed parameters. It runs for a few epochs but eventually produces loss = nan. Here’s the model I use:


def create_model(input_shape, num_classes):

    input_layer = keras.Input(shape=input_shape)
    x = layers.Conv1D(
        filters=32, kernel_size=3, strides=2, activation="relu", padding="same"
    )(input_layer)
    x = layers.BatchNormalization()(x)

    x = layers.Conv1D(
        filters=64, kernel_size=3, strides=2, activation="relu", padding="same"
    )(x)
    x = layers.BatchNormalization()(x)

    x = layers.Conv1D(
        filters=128, kernel_size=5, strides=2, activation="relu", padding="same"
    )(x)
    x = layers.BatchNormalization()(x)

    x = layers.Conv1D(
        filters=256, kernel_size=5, strides=2, activation="relu", padding="same"
    )(x)
    x = layers.BatchNormalization()(x)


    x = layers.Flatten()(x)

    x = layers.Dense(
        2048, activation="relu", kernel_regularizer=keras.regularizers.L2()
    )(x)
    x = layers.Dropout(0.2)(x)

    x = layers.Dense(
        1024, activation="relu", kernel_regularizer=keras.regularizers.L2()
    )(x)
    x = layers.Dropout(0.2)(x)
    #output_layer = layers.Dense(num_classes, activation="softmax")(x)
    output_layer = layers.Dense(num_classes, activation="linear")(x)

    return keras.Model(inputs=input_layer, outputs=output_layer)

The model is then created and compiled via:

model = create_model(input_shape=input_shape,num_classes=num_classes) #tyakes input layer for now
 model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=max_learning_rate,clipnorm=1), loss=custom_loss(x_data, t_final, lam, batch_size),metrics=['mse'])

The custom loss is:


def custom_coef(y_true, y_pred, x, t, lam, batch_size):

    Cp = x[:,:,0]
    Ct = x[:,:,1]

    Ct_hat = func_2cfm_reformulated_keras(y_pred,t,Cp,batch_size, model='2cfm')

    final_loss = K.mean(K.square((Ct_hat-Ct)))*lam+K.mean(K.square(y_true-y_pred)) #first is physical, second is just MSE


    print(f'The final loss is {final_loss}')
    return final_loss


def custom_loss(x, t, lam, batch_size):
    def phys_loss(y_true, y_pred):
        return custom_coef(y_true, y_pred, x, t, lam, batch_size)
    return phys_loss

Finally the forward model that maps the parameters to curves used in the loss is given by:

def func_2cfm_reformulated_keras(x0,t, Cp, batch_size, model):

    output_list = []
    for b in range(batch_size):
        Fp     = x0[b,0] #changed to Fp
        PS     = x0[b,1]
        ve     = x0[b,2]
        vp     = x0[b,3]

        Te = ve/PS #1
        #Fp = Ktrans*PS/(PS-Ktrans) #2
        T = (vp+ve)/Fp #3
        Tp = vp/Fp #4

        #now we convert based on 2CFM or 2CXM  models

        #2CFM
        if model == '2cfm':
            Tplus  = Te
            Tminus = Tp
        if model == '2cxm':
            Tplus  = 0.5*(T+Te+K.sqrt((T+Te)**2-4*Tp*Te))
            Tminus = 0.5*(T+Te-K.sqrt((T+Te)**2-4*Tp*Te))

        f_Tminus = [K.constant(0)]
        f_Tplus = [K.constant(0)]
        for ii in range(0,len(t)-1):
            xi   = (t[ii+1]-t[ii])/Tplus
            a    = Cp[b,:]*Fp*Tplus*(T-Tminus)/(Tplus-Tminus)
            aip  = (a[ii+1]-a[ii])/(t[ii+1]-t[ii])
            E0   = 1-K.exp(-xi)
            E1   = xi-E0

            new_val_Tplus = K.exp(-xi)*f_Tplus[ii]+a[ii]*E0+aip*Tplus*E1
            f_Tplus.append(new_val_Tplus)

            xi_2    = (t[ii+1]-t[ii])/Tminus
            a_2     = Cp[b,:]*Fp*Tminus*(Tplus-T)/(Tplus-Tminus)
            aip_2   = (a_2[ii+1]-a_2[ii])/(t[ii+1]-t[ii])
            E0_2    = 1-K.exp(-xi_2)
            E1_2    = xi_2-E0_2

            new_val_Tminus = K.exp(-xi_2)*f_Tminus[ii]+a_2[ii]*E0_2+aip_2*Tminus*E1_2
            f_Tminus.append(new_val_Tminus)

        integral_tensor = tf.stack(f_Tminus) + tf.stack(f_Tplus)
        output_list.append(integral_tensor)

    output_stack = tf.stack(output_list)
    
    return output_stack

The output is a tensor of size (b,Noints), where b is batch_size, Npoints is number of points in my curves (600). I achieve this by looping over the batch and looping over the individual points, then stacking them up with tf.stack. I’m not sure if this is the right way to do it, but I was able to get this method working with a simpler forward model. It also had the loss = nan issue, I solved it by reducing the size of the model and batch_size. No such luck with this model, however. I’ve tried a batch_size = 1, lowering learning rate to as low as 1E-8, using clipnorm, etc. I am certain that my input x and y values do not contain nan. x is scaled from 0-1 in all cases. Any other ideas? If I remove the first term in final_loss it trains without a problem, however if I leave that term in there even if lam = 0, it will eventually spit out a nan. It sounds like the forward model produces nans, but it doesn’t. I assume the gradients are exploding, but not sure how to check, why this is, or how I could fix it. I’ve tried other optimizers as well (Adam was first choice, no luck).

Hi Karl,

Whenever I had problems with NaN losses after a few steps/epochs, it has nearly always boiled down to two possible causes:

  1. using BatchNormalization : this is highly model dependent, but I’ve had cases where disabling BN made the NaN loss go away.

  2. The use of an MSE without a small epsilon to guard against a zero result. I do not know exactly the reason why this fails, but my assumption is that during gradient calculation an (almost-)zero loss leads to a division by (almost-)zero causing the backprop to skyrocket your weights. I’ve had this problem when using tf.linalg.norm, not sure it applies to the keras backend, but it might be a useful thing to try. Just add a small term like 1e-8 to the first term of your loss function.

Another thing you can try to debug this is to use tf.debugging.enable_check_numerics and run your training in eager mode. You might catch when a NaN is introduced that way.

Good luck!

1 Like

I think this might be an issue. In the first loss function formula, if the predicted parameters Ct are NaN, then the mean squared error of a NaN is NaN.

Could you please try to change the formula with K.maximum (Ct_hat-Ct, 0) is always finite, even if the predicted parameters Ct are NaN.

Just a thought, there might be other possibilities or challenges not mentioned above.

Thanks.

Hello, @ karl_landheer.

I couldn’t completely understand from the problem description - why are you using ‘softmax’ with MSE loss? Softmax is used mostly with categorical cross entropy /CCE/ type losses and for classification problems when having one-hot labels. If this is a classification problem it is not good idea to use MSE but rather CCE type loss. If this is a regression problem, I am not sure that you need the softmax, maybe sigmoid or just remove it.

Anyway,recently I made this post - it shows how TensorFlow GradientTape can be used to study what exactly s going on during the error gradients calculations and model fit() and for experiments with custom losses . It should be possible to check with this method where the NaN came from (this can happens for example when very close to zero error gradient buffer coming from incorrect labels is backpropagated).