Discrepancy between results reported by TensorFlow model.evaluate and model.predict

Hi everyone. I’m a relative newcomer to ML and have encountered a problem using TF’s model.evaluate and model.predict methods that other posts on the web don’t seem to answer.

So, I have a HuggingFace model (‘bert-base-cased’) that I’m using with TensorFlow and a custom dataset. I’ve: (1) tokenized my data (2) split the data; (3) converted the data to TF dataset format; (4) instantiated, compiled and fit the model.

During training, it behaves as one would expect: training and validation accuracy go up. But when I evaluate the model on the test dataset using TF’s model.evaluate and model.predict, the results are very different. The accuracy as reported by model.evaluate is higher (and more or less in line with the validation accuracy); the accuracy as reported by model.predict is about 10% lower. (Maybe it’s just a coincidence, but it’s similar to the reported training accuracy after the single epoch of fine-tuning.)

Can anyone figure out what’s causing this? I include snippets of my code below.

# tokenize the dataset
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="bert-base-cased",use_fast=False)

def tokenize_function(examples):
  return tokenizer(examples['text'], padding = "max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# splitting dataset
trainSize = 0.7
valTestSize = 1 - trainSize
train_testvalid = tokenized_datasets.train_test_split(test_size=valTestSize,stratify_by_column='class')
valid_test = train_testvalid['test'].train_test_split(test_size=0.5,stratify_by_column='class')

# renaming each of the datasets for convenience
train_set = train_testvalid['train']
val_set = valid_test['train']
test_set = valid_test['test']

# converting the tokenized datasets to TensorFlow datasets
data_collator = DefaultDataCollator(return_tensors="tf")
tf_train_dataset = train_set.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=['class'],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8)
tf_validation_dataset = val_set.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=['class'],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8)
tf_test_dataset = test_set.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=['class'],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8)

# loading tensorflow model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=1)

# compiling the model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-6),
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=[tf.metrics.BinaryAccuracy()])

# fitting model
history = model.fit(tf_train_dataset,
          validation_data=tf_validation_dataset,
          epochs=1)

# Evaluating the model on the test data using `evaluate`
results = model.evaluate(x=tf_test_dataset,verbose=2) # reports binary_accuracy: 0.9152

# first attempt at using model.predict method
hits = 0
misses = 0
for x, y in tf_test_dataset:
  logits = tf.keras.backend.get_value(model(x, training=False).logits)
  labels = tf.keras.backend.get_value(y)
  for i in range(len(logits)):
    if logits[i][0] < 0:
      z = 0
    else:
      z = 1
    if z == labels[i]:
      hits += 1
    else:
      misses += 1
print(hits/(hits+misses)) # reports binary_accuracy: 0.8187

# second attempt at using model.predict method
modelPredictions = model.predict(tf_test_dataset).logits
testDataLabels = np.concatenate([y for x, y in tf_test_dataset], axis=0)
hits = 0
misses = 0
for i in range(len(modelPredictions)):
  if modelPredictions[i][0] >= 0:
    z = 1
  else:
    z = 0
  if z == testDataLabels[i]:
    hits += 1
  else:
    misses += 1

print(hits/(hits+misses)) # reports binary_accuracy: 0.8187

Things I’ve tried include: (1) different loss functions (it’s a binary classification problem with the label column of the dataset filled with either a zero or a one for each row); (2) different ways of unpacking the test dataset and feeding it to model.predict; (3) altering the ‘num_labels’ parameter between 1 and 2.

A big thank you to anyone who manages to solve this!