Model.evaluate() does not yield the same accuracy as computing it manually using a for-loop

After following the transfer learning tutorial on Tensorflow’s site, I have a question about how model.evaluate() works in comparison to calculating accuracy by hand.

At the very end, after fine-tuning, in the Evaluation and prediction section, we use model.evaluate() to calculate the accuracy on the test set as follows:

loss, accuracy = model.evaluate(test_dataset)
print('Test accuracy :', accuracy)
6/6 [==============================] - 2s 217ms/step - loss: 0.0516 - accuracy: 0.9740
Test accuracy : 0.9739583134651184

Next, we generate predictions manually from one batch of images from the test set as part of a visualization exercise:

# Apply a sigmoid since our model returns logits
predictions = tf.nn.sigmoid(predictions)
predictions = tf.where(predictions < 0.5, 0, 1)

However, it’s also possible to extend this functionality to calculate predictions across the entire test set and compare them to the actual values to yield an average accuracy:

all_acc=tf.zeros([], tf.int32) #initialize array to hold all accuracy indicators (single element)
for image_batch, label_batch in test_dataset.as_numpy_iterator():
    predictions = model.predict_on_batch(image_batch).flatten() #run batch through model and return logits
    predictions = tf.nn.sigmoid(predictions) #apply sigmoid activation function to transform logits to [0,1]
    predictions = tf.where(predictions < 0.5, 0, 1) #round down or up accordingly since it's a binary classifier
    accuracy = tf.where(tf.equal(predictions,label_batch),1,0) #correct is 1 and incorrect is 0
    all_acc = tf.experimental.numpy.append(all_acc, accuracy)
all_acc = all_acc[1:]  #drop first placeholder element
avg_acc = tf.reduce_mean(tf.dtypes.cast(all_acc, tf.float16)) 
print('My Accuracy:', avg_acc.numpy()) 
My Accuracy: 0.974

Now, if model.evaluate() generates predictions by applying a sigmoid to the logit model outputs and using a threshold of 0.5 like the tutorial suggests, my manually-calculated accuracy should equal the accuracy output of Tensorflow’s model.evaluate() function. This is indeed the case for the tutorial. My Accuracy: 0.974 = accuracy from model.evaluate() function. However, when I try this same code with a model trained using the same convolutional base as the tutorial, but different Gabor images (not cats & dogs like the tutorial), my accuracy no longer equals the model.evaluate() accuracy:

current_set = set17 #define set to process. 
all_acc=tf.zeros([], tf.float64) #initialize array to hold all accuracy indicators (single element)
loss, acc = model.evaluate(current_set) #now test the model's performance on the test set
for image_batch, label_batch in current_set.as_numpy_iterator():
    predictions = model.predict_on_batch(image_batch).flatten() #run batch through model and return logits
    predictions = tf.nn.sigmoid(predictions) #apply sigmoid activation function to transform logits to [0,1]
    predictions = tf.where(predictions < 0.5, 0, 1) #round down or up accordingly since it's a binary classifier
    accuracy = tf.where(tf.equal(predictions,label_batch),1,0) #correct is 1 and incorrect is 0
    all_acc = tf.experimental.numpy.append(all_acc, accuracy)
all_acc = all_acc[1:]  #drop first placeholder element
avg_acc = tf.reduce_mean(all_acc)
print('My Accuracy:', avg_acc.numpy()) 
print('Tf Accuracy:', acc) 
My Accuracy: 0.832
Tf Accuracy: 0.675000011920929

Does anyone know why there would be a discrepancy? Does the model.evaluate() not use a sigmoid? Or does it use a different threshold than 0.5? Or perhaps it’s something else I’m not considering? Please note, my new model was trained using Gabor images, which are different than the cats and dogs from the tutorial, but the code was the same.

Thank you in advance for any insight!

Hi @bryan

Welcome to the TensorFlow Forum!

Yes, This could be the possible reason that your model’s evaluate() accuracy is different from the manually calculated accuracy because you are using completely different images than which were used to train the model.

To overcome this issue, you can use a larger and more diverse training dataset or can use the data augmentation techniques to increase the variety of the training data.