Model.evaluate() not returning expected loss in certain contexts

In the following snippet, I construct a very simple neural network and evaluate it on some synthetic data. I don’t train the network, just evaluate it with the initial weights. I’m using binary cross entropy, and I compute the loss in two ways:

  1. By calling model.evaluate().
  2. By calling model.predict() and manually computing the loss.

I expect to get the same result in each case, but I consistently get very different values.

import keras
import numpy as np

input_dimension = 100

rng = np.random.default_rng(0)
train_xs = rng.uniform(0, 255, (1, input_dimension))
train_ys = rng.uniform(0, 1, (1, 1))

keras_model = keras.Sequential([
    keras.layers.Dense(1000, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid")
])

keras_model.compile(
    optimizer=keras.optimizers.SGD(learning_rate=0.01, momentum=0.0, nesterov=False),
    loss=keras.losses.binary_crossentropy,
    metrics=["accuracy"]
)

keras_model.build(input_shape=(1, input_dimension))

y = keras_model.predict(train_xs)

loss, _ = keras_model.evaluate(train_xs, train_ys, verbose=0)

print(loss)
print(keras.losses.binary_crossentropy(train_ys, y))

If the activation in the hidden layer is changed from ReLU to sigmoid, the behavior goes away. What’s going on?

I believe I’m closer to figuring it out. Here’s a simpler case:

import keras
import numpy as np
import tensorflow as tf

def binary_cross_entropy(y_hat, y):
    epsilon = 1e-7
    y_hat = np.clip(y_hat, epsilon, 1 - epsilon)
    bce = y * np.log(y_hat + epsilon)
    bce += (1 - y) * np.log(1 - y_hat + epsilon)
    return -np.mean(bce)

train_xs = np.ones((1, 1000)) * 10000
train_ys = np.zeros((1, 1))

tf.random.set_seed(2)
keras_model = keras.Sequential([
    keras.layers.Dense(1000, activation="linear"),
    keras.layers.Dense(1, activation="sigmoid")
])

keras_model.compile(
    loss=keras.losses.binary_crossentropy,
    metrics=[]
)

y = keras_model.predict(train_xs)
assert y.shape == train_ys.shape

metrics = keras_model.evaluate(train_xs, train_ys, verbose=0, return_dict=True)

print("Outputs:", y)
print(metrics["loss"])
print(keras.losses.binary_crossentropy(train_ys, y))
print(binary_cross_entropy(y, train_ys))

So now I’m using non-random, constant inputs, and my target is zero. I also changed the activation in the hidden layer to linear (it also works with ReLU - just not with sigmoid) I’m computing the BCE in three ways:

  1. model.evaluate()
  2. model.predict() and Keras’s implementation of BCE
  3. model.predict() and my own implementation of BCE

(2) and (3) match, but (1) gives a very different value. What surprises me is that the loss should depend only on the outputs of the model, but if I change the constant 1000 at line 12 to 10000 or 100000, the value of (1) continues to increase, even though the output of the model is constant the whole time (equal to 1). For example, when the constant is 1000:

Outputs: [[1.]]
909.5416870117188
tf.Tensor([15.333239], shape=(1,), dtype=float32)
15.33323860168457

when it’s 10000:

Outputs: [[1.]]
9095.416015625
tf.Tensor([15.333239], shape=(1,), dtype=float32)
15.33323860168457

Notice the output of model.evaluate() was scaled by a factor of 10, just like the inputs to the model. This leads me to the following hypothesis. Presumably, inside model.evaluate(), the model’s predictions are computed using a higher precision than what I’m getting by running model.predict() myself. Because the sigmoid exponentially decays towards 1 at infinity, when I scale the inputs by 10, this scales the inputs to the sigmoid by 10 (because I’m using linear or ReLU activations in the hidden layer), which causes the sigmoid’s output to be, basically, 1 - exp(-10t) instead of 1 - exp(-t). Because BCE is logarithmic, this undoes that and we get a loss which is ten times higher. However, when I do this same calculation in (2) or (3), the output of the model is saturated at 1 and so nothing changes.

This difference might seem trivial, but I think it actually has dramatic consequences. When I train certain networks using model.fit(), which apparently uses the same higher-precision calculations that model.evaluate() is using, I get 90% test accuracy. At the lower precision of the manual model.predict() method (implemented using another library, namely JAX), in the same scenario I get 50% accuracy! I believe this discrepancy is what’s ultimately behind issue #7171 I raised on the JAX issue tracker (sorry, I’m not allowed to post a link).

What I would like to know is: how does model.evaluate() compute the model’s predictions and the BCE loss? I tried debugging it, but the source code is completely opaque to me, jumping through layer after layer of wrappers and compilation steps. How can I, for example, compute my model’s predictions in the same way that model.evaluate() does it, and then compute the loss in the same way that model.evaluate() does it?