I believe I’m closer to figuring it out. Here’s a simpler case:
import keras
import numpy as np
import tensorflow as tf
def binary_cross_entropy(y_hat, y):
epsilon = 1e7
y_hat = np.clip(y_hat, epsilon, 1  epsilon)
bce = y * np.log(y_hat + epsilon)
bce += (1  y) * np.log(1  y_hat + epsilon)
return np.mean(bce)
train_xs = np.ones((1, 1000)) * 10000
train_ys = np.zeros((1, 1))
tf.random.set_seed(2)
keras_model = keras.Sequential([
keras.layers.Dense(1000, activation="linear"),
keras.layers.Dense(1, activation="sigmoid")
])
keras_model.compile(
loss=keras.losses.binary_crossentropy,
metrics=[]
)
y = keras_model.predict(train_xs)
assert y.shape == train_ys.shape
metrics = keras_model.evaluate(train_xs, train_ys, verbose=0, return_dict=True)
print("Outputs:", y)
print(metrics["loss"])
print(keras.losses.binary_crossentropy(train_ys, y))
print(binary_cross_entropy(y, train_ys))
So now I’m using nonrandom, constant inputs, and my target is zero. I also changed the activation in the hidden layer to linear (it also works with ReLU  just not with sigmoid) I’m computing the BCE in three ways:
model.evaluate()

model.predict()
and Keras’s implementation of BCE

model.predict()
and my own implementation of BCE
(2) and (3) match, but (1) gives a very different value. What surprises me is that the loss should depend only on the outputs of the model, but if I change the constant 1000
at line 12 to 10000 or 100000, the value of (1) continues to increase, even though the output of the model is constant the whole time (equal to 1
). For example, when the constant is 1000
:
Outputs: [[1.]]
909.5416870117188
tf.Tensor([15.333239], shape=(1,), dtype=float32)
15.33323860168457
when it’s 10000:
Outputs: [[1.]]
9095.416015625
tf.Tensor([15.333239], shape=(1,), dtype=float32)
15.33323860168457
Notice the output of model.evaluate()
was scaled by a factor of 10, just like the inputs to the model. This leads me to the following hypothesis. Presumably, inside model.evaluate()
, the model’s predictions are computed using a higher precision than what I’m getting by running model.predict()
myself. Because the sigmoid exponentially decays towards 1 at infinity, when I scale the inputs by 10, this scales the inputs to the sigmoid by 10 (because I’m using linear or ReLU activations in the hidden layer), which causes the sigmoid’s output to be, basically, 1  exp(10t) instead of 1  exp(t). Because BCE is logarithmic, this undoes that and we get a loss which is ten times higher. However, when I do this same calculation in (2) or (3), the output of the model is saturated at 1
and so nothing changes.
This difference might seem trivial, but I think it actually has dramatic consequences. When I train certain networks using model.fit()
, which apparently uses the same higherprecision calculations that model.evaluate()
is using, I get 90% test accuracy. At the lower precision of the manual model.predict()
method (implemented using another library, namely JAX), in the same scenario I get 50% accuracy! I believe this discrepancy is what’s ultimately behind issue #7171 I raised on the JAX issue tracker (sorry, I’m not allowed to post a link).
What I would like to know is: how does model.evaluate()
compute the model’s predictions and the BCE loss? I tried debugging it, but the source code is completely opaque to me, jumping through layer after layer of wrappers and compilation steps. How can I, for example, compute my model’s predictions in the same way that model.evaluate()
does it, and then compute the loss in the same way that model.evaluate()
does it?