Checkpoint not saved when using ModelCheckpoint with save_freq = 5000

hello,
I was not sure if I could report it as a bug… I’m practicing using an argument, save_freq = as an interger from an online course (they use tensorflow 2.0.0 while I have the latest tensorflow, 2.5.0).

here’s the relevant documentation but without an example using interger.

here is my code:

import tensorflow as tf
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D

using a CIFAR dataset sample

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0

Use smaller subset – speeds things up

x_train = x_train[:10000]
y_train = y_train[:10000]
x_test = x_test[:1000]
y_test = y_test[:1000]

define a function that creates a new instance of a simple CNN.

def get_new_model():
model = Sequential([
Conv2D(filters=16, input_shape=(32, 32, 3), kernel_size=(3, 3),
activation=‘relu’, name=‘conv_1’),
Conv2D(filters=8, kernel_size=(3, 3), activation=‘relu’, name=‘conv_2’),
MaxPooling2D(pool_size=(4, 4), name=‘pool_1’),
Flatten(name=‘flatten’),
Dense(units=32, activation=‘relu’, name=‘dense_1’),
Dense(units=10, activation=‘softmax’, name=‘dense_2’)
])
model.compile(optimizer=‘adam’,
loss=‘sparse_categorical_crossentropy’,
metrics=[‘accuracy’])
return model

Create Tensorflow checkpoint object with epoch and batch details

checkpoint_5000_path = ‘/model_checkpoints_5000/checkpoint_{epoch:02d}-{batch:04d}’
checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_5000_path,
save_weights_only = True,
save_freq = 5000,
verbose = 1)

Create and fit model with checkpoint

model = get_new_model()
model.fit(x = x_train,
y = y_train,
epochs = 3,
validation_data = (x_test, y_test),
batch_size = 10,
callbacks = [checkpoint_5000])

It is meant to make and save the filenames including the epoch and batch number.
However, the files are not created. After I create manually this directory, model_checkpoints_5000, no files are added in.
(we can check the contents by running ’ ! dir -a model_checkpoints_5000’ (windows)
or ‘ls -lh model_checkpoints_500’ (linux)).

I have also tried to change to ‘’/model_checkpoints_5000/checkpoint_{epoch:02d}’, it still does not save the files with every epoch’s number.

Hi @Maria_B

Try changing the save_freq parameter in ModelCheckpoint to something smaller, like 1000, since you’re only training over 3 epochs here.

Full code:

import tensorflow as tf
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0

x_train = x_train[:10000]
y_train = y_train[:10000]
x_test = x_test[:1000]
y_test = y_test[:1000]

def get_new_model():
  model = Sequential([
                      Conv2D(filters=16,
                             input_shape=(32, 32, 3),
                             kernel_size=(3, 3),
                             activation='relu',
                             name='conv_1'),
                      Conv2D(filters=8,
                             kernel_size=(3, 3),
                             activation='relu',
                             name='conv_2'),
                      MaxPooling2D(pool_size=(4, 4),
                                   name='pool_1'),
                      Flatten(name='flatten'),
                      Dense(units=32, activation='relu', name='dense_1'),
                      Dense(units=10, activation='softmax', name='dense_2')
                      ])
  model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])
  return model

checkpoint_5000_path = '/model_checkpoints_5000/checkpoint_{epoch:02d}-{batch:04d}'
checkpoint_5000 = ModelCheckpoint(filepath=checkpoint_5000_path,
                                  save_weights_only=True,
                                  save_freq=1000,
                                  verbose=1)

model = get_new_model()

model.fit(x=x_train,
          y=y_train,
          epochs=3,
          validation_data=(x_test, y_test),
          batch_size=10,
          callbacks=[checkpoint_5000])

Your output should hopefully include the following during training:

Epoch 00001: saving model to /model_checkpoints_5000/checkpoint_01-1000
1000/1000 [==============================] ......
Epoch 2/3
 999/1000 [============================>.] ......
Epoch 00002: saving model to /model_checkpoints_5000/checkpoint_02-1000
1000/1000 [==============================] ......
Epoch 3/3
 997/1000 [============================>.] ......
Epoch 00003: saving model to /model_checkpoints_5000/checkpoint_03-1000

Then, check the /model_checkpoints_5000/ folder.

Let us know if this helps.

Maybe this could also help:

save_freq='epoch'

There is another good example in this tutorial Save and load models  |  TensorFlow Core

...
batch_size = 32

# Create a callback that saves the model's weights every 5 epochs
cp_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_path, 
    verbose=1, 
    save_weights_only=True,
    save_freq=5*batch_size)
...