ModelCheckpoint fails to format filename if save_freq is used

Hi everyone.

I’m trying to save a Keras model with ModelCheckpoint callback every two epochs.

If I save the model every epoch by using save_freq="epoch", everything is fine and I can use val_mean_absolute_error to format the filename. However, if I use 2* int(ceil(train_size/batch_size)) which is equal to two epochs, Keras shows an error.

KeyError: 'Failed to format this callback filepath: "saved-model_{epoch:02d}_{val_mean_absolute_error:.2f}.h5". Reason: \'val_mean_absolute_error\'' 

Below is the code, got it from here:

import tensorflow as tf
from tensorflow import keras

def get_model():
    model = keras.Sequential()
    model.add(keras.layers.Dense(1, input_dim=784))
    model.compile(
        optimizer=keras.optimizers.RMSprop(learning_rate=0.1),
        loss="mean_squared_error",
        metrics=["mean_absolute_error"],
    )
    return model

# Load example MNIST data and pre-process it
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(-1, 784).astype("float32") / 255.0
x_test = x_test.reshape(-1, 784).astype("float32") / 255.0

# Limit the data to 1000 samples
x_train = x_train[:1000]
y_train = y_train[:1000]
x_test = x_test[:1000]
y_test = y_test[:1000]


nSteps = int(tf.math.ceil(len(x_train)/128))
filepath = "saved-model_{epoch:02d}_{val_mean_absolute_error:.2f}.h5"

callbacks = [
    tf.keras.callbacks.ModelCheckpoint(filepath=filepath, monitor='val_mean_absolute_error', verbose=1, 
            save_best_only=False, mode='min', save_freq=2*nSteps)
]

model = get_model()
history = model.fit(
    x_train,
    y_train,
    validation_data=(x_test,y_test),
    batch_size=128,
    epochs=4,
    verbose=1,
    callbacks=callbacks,
)

I’m not sure if it’s a bug, but something is not right!

Thank you.

==================
Edited:

After a bit of debugging, I found this code in callback.py

  def _implements_train_batch_hooks(self):
    # Only call batch hooks when saving on batch
    return self.save_freq != 'epoch'

  def _implements_train_batch_hooks(self):
    """Determines if this Callback should be called for each train batch."""
    return (not generic_utils.is_default(self.on_batch_begin) or
            not generic_utils.is_default(self.on_batch_end) or
            not generic_utils.is_default(self.on_train_batch_begin) or
            not generic_utils.is_default(self.on_train_batch_end))

  def _implements_test_batch_hooks(self):
    """Determines if this Callback should be called for each test batch."""
    return (not generic_utils.is_default(self.on_test_batch_begin) or
            not generic_utils.is_default(self.on_test_batch_end))

Accordingly, if I set save_freq='epoch', self.on_train_batch_end() is skipped and self.on_test_batch_end can format the filename correctly. So I think this is a bug and the code should consider somehow if the save_freq == n_epoch or provides another parameter to say if it is epoch or step.

Hi @msat59

Welcome to the TensorFlow Forum!

In ModelCheckpoint callback, save_freq takes value as'epoch' or integer. When using 'epoch', the callback saves the model after each epoch. When using integer, the callback saves the model at end of this many batches.

But if filepath is weights.{epoch:02d}-{val_loss:.2f}.hdf5 , then the model checkpoints will be saved with the epoch number and the validation loss in the filename. For that you may need to define the save_freq = 'epoch' (which is default if not defined). If you use save_freq = 2 * nSteps, this will save model at end of this many batches. This will not be compatible to the weight filepath format you have defined for every epoch.

Please refer to this gist for your reference. Thank you.