Different Results for model.evaluate() compared to model()

Hi. I have trained a MobileNets model and in the same code used the model.evaluate() on a set of test data to determine its performance. This test is indicating nearly 97% accuracy. Here is the code that performs this.

import os
import tensorflow.keras as keras
from tensorflow.keras.applications import MobileNet
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import load_model
from tensorflow.keras.callbacks import ModelCheckpoint

image_size_y = 1056 # The height of one input image
image_size_x = 1920 # The width of one input image

Choose a width multiplier which changes the number of filters per layer

depth_mul = 1.0/8.0

Set input shape for color images

shape = (image_size_y, image_size_x, 3)

Import the MobileNet model and set input dimensions and hyperparameters.

model = MobileNet(input_shape=shape, alpha=depth_mul, weights=None, classes=2)

Setting up the data directory paths

BaseDir = os.path.join(‘path’,‘to’,‘directory’,‘containing’,‘data’)

train_dir = os.path.join(BaseDir,‘train’)
val_dir = os.path.join(BaseDir,‘val’)
test_dir = os.path.join(BaseDir,‘test’)

train_positive_dir = os.path.join(train_dir,‘positive’)
train_negative_dir = os.path.join(train_dir,‘negative’)

val_positive_dir = os.path.join(val_dir,‘positive’)
val_negative_dir = os.path.join(val_dir,‘negative’)

test_positive_dir = os.path.join(test_dir,‘positive’)
test_negative_dir = os.path.join(test_dir,‘negative’)

Define desired Batch Size

batchsize = 32

Only use data augmentation that generate images that could reasonably occur in real-world situation (just scale brightness a bit)

train_datagen = ImageDataGenerator(
rescale= 1./255,
valid_datagen = ImageDataGenerator(rescale = 1./255)
test_datagen = ImageDataGenerator(rescale = 1./255)

Create Data Generators for each group of data

train_generator = train_datagen.flow_from_directory(

validation_generator = valid_datagen.flow_from_directory(

test_generator = test_datagen.flow_from_directory(

Compile the model for training

metrics = [‘accuracy’]

Save the model at every epoch, overwriting each time, so the final version after the last epoch will remain and can be tested

finalNetwork = os.path.join(‘path’,‘to’,‘MobileNetsModel.h5’)
mcf = ModelCheckpoint(finalNetwork)

Train the network

history = model.fit(
steps_per_epoch = 40646 // batchsize,
epochs = 20,
validation_data = validation_generator,
validation_steps = 5080 // batchsize,
callbacks = [mcf]

Evaluation on test data of the model after the final epoch of training

saved_model = load_model(finalNetwork)
_,test_acc = saved_model.evaluate(test_generator,verbose = 0)
print(“Final Model Accuracy = %.1f%%” % (100.0 *test_acc))


And then I created another piece of code to actually use the trained model, but it doesn’t seem to be working. I’m getting nearly 50% true positives and 50% false positives, so only 50% accuracy. Here is that code. Am I performing the inferences wrong in this code? Am I not saving or loading my model properly? Please help!

import os
from matplotlib import image
import tensorflow as tf
from tensorflow.keras.models import load_model

Load a model that was trained and saved

model = load_model(os.path.join(‘path’,‘to’,‘MobileNetsModel.h5’))

Set the directory containing the test images

datadir = os.path.join(‘directory’,‘containing’,‘jpgs’)

Get the filenames of all the test images

imgNames = os.listdir(datadir)

Make inferences using the provided model

for imgName in imgNames:

# Get the image
img = image.imread(os.path.join(datadir,imgName))

# Make an inference
input = tf.convert_to_tensor(img)
input = tf.image.resize(input,(1056,1920))
input = input[None,:,:,:]
input = input/255.0
output = model(input)
prob_pos = output.numpy()[0,0]*100
prob_neg = output.numpy()[0,1]*100

# Categorize inferences and output to console
if prob_pos >= prob_neg:
    print(imgName,' is positive')
    print(imgName,' is negative')


I tried to read all the code but I got lost (maybe I need to sleep a little bit more :slight_smile: )

can you try your data adapting this colab: إعادة تدريب مصنف الصور  |  TensorFlow Hub

I modified the post getting rid of any extraneous code. Could you maybe look through it again? I checked out that link, and as far as I could tell I’m doing the same thing. I feel like I’m missing something.

Is it possibly because I have used jpg file format for my images?

One thing you could do is try to visualize some of the images from the train/evaluate/test data pipeline.

You’re using some very big images with a network that usually word on smaller images. The resize might be changing the image too much.

1 Like

I didn’t look into your code but a major difference between model.evaluate() and model() is that if you don’t run model(..., training=False) (where ... refers to the inputs) then the layers are not going to run in inference mode which is not ideal for layers like Dropout, BatchNorm, etc.


Also, @fchollet explains the difference between model.predict() and model(...) in his book:



1 Like

I visualized the data generator images and the resolutions were inverted (squished into portrait instead of landscape). I think the fit function automatically then rotated them to match the defined input size for the network. But then the model() operation doesn’t automatically rotate an input for you. So I swapped the x and y dimensions of the data generators. I will update this post after training and trying model() again after this change.

1 Like

I don’t think this was the issue, but this was helpful. I will include training=False in my code. Thank you.

I have confirmed that the dimensions of images in my data generators were flipped. It appears that the fit() and evaluate() functions will automatically rotate images to fit the input of a model for you, however calling the model directly on an input will not. After fixing the order of my dimensions and retraining, calling the model directly gives me the same accuracy as using evaluate(). Thank you, everyone, for your help.

1 Like