Basic text classification sample why need from_logits=True?

yaiba · June 12, 2023, 2:22am

Hi, I’m running a basic text classification sample from Tensorflow here: Classificação de texto com avaliações de filmes | TensorFlow Core

One thing I don’t understand is that why we need to use from_logits=True with BinaryCrossentropy loss? When I tried to remove it and add activation="sigmoid" to the last Dense layer then binary_accuracy does not move at all when training.

Changed code:

model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, embedding_dim),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(1, activation="sigmoid")]) # <-- Add activation = sigmoid here

model.compile(loss=losses.BinaryCrossentropy(), # Remove from_logits=True here
              optimizer='adam',
              metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

epochs = 10
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs)

Training outputs

Epoch 1/10
625/625 [==============================] - 4s 4ms/step - loss: 0.6635 - binary_accuracy: 0.4981 - val_loss: 0.6149 - val_binary_accuracy: 0.5076
Epoch 2/10
625/625 [==============================] - 2s 4ms/step - loss: 0.5492 - binary_accuracy: 0.4981 - val_loss: 0.4990 - val_binary_accuracy: 0.5076
Epoch 3/10
625/625 [==============================] - 2s 4ms/step - loss: 0.4453 - binary_accuracy: 0.4981 - val_loss: 0.4208 - val_binary_accuracy: 0.5076
Epoch 4/10
625/625 [==============================] - 2s 4ms/step - loss: 0.3792 - binary_accuracy: 0.4981 - val_loss: 0.3741 - val_binary_accuracy: 0.5076
Epoch 5/10
625/625 [==============================] - 3s 4ms/step - loss: 0.3360 - binary_accuracy: 0.4981 - val_loss: 0.3454 - val_binary_accuracy: 0.5076
Epoch 6/10
625/625 [==============================] - 3s 4ms/step - loss: 0.3054 - binary_accuracy: 0.4981 - val_loss: 0.3262 - val_binary_accuracy: 0.5076
Epoch 7/10
625/625 [==============================] - 3s 4ms/step - loss: 0.2813 - binary_accuracy: 0.4981 - val_loss: 0.3126 - val_binary_accuracy: 0.5076
Epoch 8/10
625/625 [==============================] - 3s 4ms/step - loss: 0.2616 - binary_accuracy: 0.4981 - val_loss: 0.3033 - val_binary_accuracy: 0.5076
Epoch 9/10
625/625 [==============================] - 3s 4ms/step - loss: 0.2456 - binary_accuracy: 0.4981 - val_loss: 0.2967 - val_binary_accuracy: 0.5076
Epoch 10/10
625/625 [==============================] - 2s 4ms/step - loss: 0.2306 - binary_accuracy: 0.4981 - val_loss: 0.2920 - val_binary_accuracy: 0.5076

Laxma_Reddy_Patlolla · June 12, 2023, 10:41pm

Hi @yaiba ,

I think you misunderstood about the from_logits=True with BinaryCrossentropy loss. Your issue is not related to adding or removing activation="sigmoid" to the last dense layer. It’s related to metrics=tf.metrics.BinaryAccuracy(threshold=0.0).

I just want to clarify for you on the from_logits=TruewithBinaryCrossentropy loss:

When you use the BinaryCrossentropy loss function with from_logits=True, the loss function expects the output of the last layer to be a linear combination of the weights and biases, without any additional activation function. This is because the BinaryCrossentropy loss function is designed to work with logits.

When you add an activation function to the last layer, such as sigmoid , the output of the layer is no longer a linear combination of the weights and biases. Instead, the output is a probability distribution, where the probability of a positive label is given by the output of the sigmoid function.

if you need to use the 'sigmoid in the last layer, then BinaryCrossentropy loss function with from_logits=False. This will tell the loss function to expect the output of the last layer to be a probability distribution.

Now, coming back to your issue, if you set the threshold to 0.0, it means that any predicted probability above 0.0 is considered positive, resulting in the binary accuracy being unaffected during training (because all your actual values are equally distributed in your training data and it is matching with predicted values). This is because with the sigmoid activation function, the predicted probabilities are always greater than 0.0

Here is the reference link how BinaryAccuracy works

I hope this helps!

Thanks.

yaiba · June 13, 2023, 1:53am

Thanks @Laxma_Reddy_Patlolla for the explanation. I got it now.

Mah_Neh · June 13, 2023, 10:23am

When you use the BinaryCrossentropy loss function with from_logits=True, the loss function expects the output of the last layer to be a linear combination of the weights and biases, without any additional activation function.

Imho it when the output is not a probability distribution (or not normalized) rather than “raw output” (i.e from output = input * w + b) ?

A bit like here machine learning - What is the meaning of the word logits in TensorFlow? - Stack Overflow

I also see another answer stating what you just said: machine learning - What is the meaning of the word logits in TensorFlow? - Stack Overflow