MeanMetricWrapper produces inconsistent results on multiple runs

Hi, I have noticed an inconsistency where wrapping the metric and passing pure lambda produce different results some of time. Google Colab
If you run the last cell multiple times, you will see instances where mean_squared_wrapped and mean_squared_error_fn are not equal to each other.
How can we explain this?

Thanks

I think you need to fix your colab:

NameError: name 'custom_mean_squared_error' is not defined

Thanks for pointing that. I just updated colab with the missing function

def custom_mean_squared_error(y_true, y_pred):
    return tf.math.reduce_mean(tf.square(y_true - y_pred))

I suppose that you need to use something like:

def mean_squared_error_fn(y_true, y_pred):
    return tf.math.reduce_mean(tf.square(y_true - y_pred))

def squared_error_fn(y_true, y_pred):
    return tf.square(y_true - y_pred)
    
mean_squared_wrapped = tf.keras.metrics.MeanMetricWrapper(fn=squared_error_fn, name='mean_squared_wrapped')

I just tried using squared_error_fn (updated colab as well). It gives the same inconsistent results on the first eval (after compile) sometimes.

I think that you need to maintain both if you want to compare the wrapped one with with the mean_squared_error_fn.

Try to run this:

import numpy as np

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)   

tf.random.set_seed(0)
np.random.seed(0)

def squared_error_fn(y_true, y_pred):
    return tf.square(y_true - y_pred)

def mean_squared_error_fn(y_true, y_pred):
    return tf.math.reduce_mean(tf.square(y_true - y_pred))

def squared_error_fn(y_true, y_pred):
    return tf.square(y_true - y_pred)
    
mean_squared_wrapped = tf.keras.metrics.MeanMetricWrapper(fn=squared_error_fn, name='mean_squared_wrapped')

def custom_mean_squared_error(y_true, y_pred):
    return tf.math.reduce_mean(tf.square(y_true - y_pred))

def get_compiled_model():
    
    inputs = keras.Input(shape=(784,), name="digits")
    x = layers.Dense(64, activation="relu", name="dense_1")(inputs)
    x = layers.Dense(64, activation="relu", name="dense_2")(x)
    outputs = layers.Dense(10, activation="softmax", name="predictions")(x)
    model = keras.Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer='adam',
              loss=custom_mean_squared_error,
              metrics=['accuracy', mean_squared_wrapped, mean_squared_error_fn])
    return model

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

x_train = x_train.reshape(60000, 784).astype("float32") / 255
x_test = x_test.reshape(10000, 784).astype("float32") / 255

y_train = y_train.astype("float32")
y_test = y_test.astype("float32")

print("TF version: ",tf.__version__)
compiled_model = get_compiled_model()
one_hot_y_train = tf.one_hot(y_train, depth=10)
print(compiled_model.evaluate(x_train, one_hot_y_train, verbose=2))
print(compiled_model.evaluate(x_train, one_hot_y_train, verbose=2))
print(compiled_model.evaluate(x_train, one_hot_y_train, verbose=2))
1 Like

Thanks for taking a look. So the issue happens only on repeated compiles, not on first compile of model.
So I added a for loop in the new colab so that we can run once and see the mismatch. I am not sure if this is expected.

Pasted the same here

import numpy as np

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)   

tf.random.set_seed(0)
np.random.seed(0)

def squared_error_fn(y_true, y_pred):
    return tf.square(y_true - y_pred)

def mean_squared_error_fn(y_true, y_pred):
    return tf.math.reduce_mean(tf.square(y_true - y_pred))
    
mean_squared_wrapped = tf.keras.metrics.MeanMetricWrapper(fn=squared_error_fn, name='mean_squared_wrapped')

def custom_mean_squared_error(y_true, y_pred):
    return tf.math.reduce_mean(tf.square(y_true - y_pred))

def get_compiled_model():
    inputs = keras.Input(shape=(784,), name="digits")
    x = layers.Dense(64, activation="relu", name="dense_1")(inputs)
    x = layers.Dense(64, activation="relu", name="dense_2")(x)
    outputs = layers.Dense(10, activation="softmax", name="predictions")(x)
    model = keras.Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer='adam',
              loss=custom_mean_squared_error,
              metrics=['accuracy', mean_squared_wrapped, mean_squared_error_fn])
    return model

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

x_train = x_train.reshape(60000, 784).astype("float32") / 255
x_test = x_test.reshape(10000, 784).astype("float32") / 255

y_train = y_train.astype("float32")
y_test = y_test.astype("float32")

print("TF version: ",tf.__version__)

one_hot_y_train = tf.one_hot(y_train, depth=10)

for i in range(3):
    compiled_model = get_compiled_model()
    eval1 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)
    eval2 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)
    
    metric_index = 2 # mean_squared_wrapped
    if abs(eval1[metric_index] - eval2[metric_index]) > 1e-5:
        print(f"mismatch found in compile: {i}")
        print("eval1: ", eval1)
        print("eval2: ", eval2)

Have you tried to move:
mean_squared_wrapped = tf.keras.metrics.MeanMetricWrapper(fn=squared_error_fn, name='mean_squared_wrapped')

inside get_compiled_model function scope?

2 Likes

I just tried that and it fixed the issue. No more mismatch. Thanks.

So what is the conclusion for this issue? that MeanMetricWrapper has side effects?

I don’t know if something is cached internally.

/cc @Scott_Zhu What do you think?

1 Like

that MeanMetricWrapper has side effects?

something is cached internally.

The MeanMetricWrapper does have state (the running mean) is it possible that it’s just not getting reset correctly?

I have bumped into a similar issue with compile editing the metric object, and then multiple compile calls stacking up the modifications. This feels a little similar.

I think it works also with:

for i in range(3):
    compiled_model = get_compiled_model()
    eval1 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)
    eval2 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)
    compiled_model.reset_metrics()

Yes, this works

for i in range(3):
    compiled_model = get_compiled_model()
    eval1 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)
    eval2 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)
    compiled_model.reset_metrics()

but this doesn’t

for i in range(3):
    compiled_model = get_compiled_model()
    compiled_model.reset_metrics()
    eval1 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)
    eval2 = compiled_model.evaluate(x_train, one_hot_y_train, verbose=2)

I suppose that evaluate doesn’t reset at the end of the cycle but just at the beginning: