Backpropagation for all zero label vectors in softmax/crossentropy

We use a Resnet to predict several classification Task(multitask learning). Some of the labelsare not avaible for every datapoint. Hence we decided to give them an all zero label vector which should yield a zero derivate for this particuliar loss. However this does not seem to be the case for every combination oft the softmax layer an the categorical crossentropy loss. The first variant yields a non zero gradient, the second implementation yields a zero gradient. The latter is expected.
Generally I assumed that both implementations do the same. Hence I tried to look into the function definitions but could not understand the difference. Is this expected behaviour? How do the implementations differ?

x = tf.constant([[0.1,0.90]])
y = tf.constant([[0.0,.0]])
with tf.GradientTape() as g:
  g.watch(x)
  z = keras.Sequential(
         keras.layers.Dense(2, activation = "softmax", use_bias = False)
      )(x)
  z = tf.losses.categorical_crossentropy(y, z)
dz_dx = g.gradient(z, x)
print(dz_dx)
x = tf.constant([[0.1,0.90]])
y = tf.constant([[0.0,.0]])
with tf.GradientTape() as g:
  g.watch(x)
  z = keras.Sequential(
         [keras.layers.Dense(2, activation = "linear", use_bias = False),   
          keras.layers.Softmax(axis = -1)]
      )(x)
  z = tf.losses.categorical_crossentropy(y, z)
dz_dx = g.gradient(z, x)
print(dz_dx)

@Janushki,

Welcome to the Tensorflow Forum!

 with tf.GradientTape() as g:
  g.watch(x)
  z = keras.Sequential(keras.layers.Dense(2, activation = "softmax", use_bias = False))(x)
  z = tf.losses.categorical_crossentropy(y, z)
dz_dx = g.gradient(z, x)

Here the gradients are being computed with respect to the softmax outputs.

with tf.GradientTape() as g:
  g.watch(x)
  z = keras.Sequential([keras.layers.Dense(2, activation = "linear", use_bias = False),   tf.keras.layers.Softmax(axis = -1)])(x)
  z = tf.losses.categorical_crossentropy(y, z)
dz_dx = g.gradient(z, x)

Where as here the gradients are being computed with respect to the raw logits.

When using categorical cross-entropy loss, if the label vector is all zeros, the loss and gradient with respect to the logits should be zero. However, the non-linear nature of softmax may cause non-zero gradients with respect to the softmax output.

For the second implementation to work as expected use from_logits=True in tf.losses.categorical_crossentropy()

Thank you!

Thank you :slight_smile:

Honestly now I am a bit confused. Did you see that the second block has a Softmax layer after the linear layer? I did bad code formatting, sorry…
I added from_logits=True however and the gradiant became non zero. Which confuses me as well, as it should not change anything? or mabey it changes “y”.

I am quite sure that categoricalcrossentropy in combination with a softmax layer shoud have zero gradients for an all zero target vector. I can do the maths here if you want, mabey I did a mistake…

@Janushki,

Sorry for the confusion. The issue is fixed in tf-nightly.

import tensorflow as tf
from tensorflow import keras
tf.keras.utils.set_random_seed(42)

x = tf.constant([[0.1,0.90]])
y = tf.constant([[0.0,.0]])
with tf.GradientTape() as g:
  g.watch(x)
  z = keras.Sequential(
         keras.layers.Dense(2, activation = "softmax", use_bias = False)
      )(x)
  z = tf.losses.categorical_crossentropy(y, z)
dz_dx = g.gradient(z, x)
print(dz_dx)

Output:

tf.Tensor([[0.5045723 0.78110194]], shape=(1, 2), dtype=float32)

import tensorflow as tf
from tensorflow import keras
tf.keras.utils.set_random_seed(42)

x = tf.constant([[0.1,0.90]])
y = tf.constant([[0.0,.0]])
with tf.GradientTape() as g:
  g.watch(x)
  z = keras.Sequential(
         [keras.layers.Dense(2, activation = "linear", use_bias = False),
          keras.layers.Softmax(axis = -1)]
      )(x)
  z = tf.losses.categorical_crossentropy(y, z)
dz_dx = g.gradient(z, x)
print(dz_dx)

Output:

tf.Tensor([[0.5045723 0.78110194]], shape=(1, 2), dtype=float32)

For more details please refer to keras-team/keras@304bb3d commit and gist.

Thank you!