Checkpoint not saved when using ModelCheckpoint with save_freq = 5000

Maria_B · December 16, 2021, 3:46am

hello,
I was not sure if I could report it as a bug… I’m practicing using an argument, save_freq = as an interger from an online course (they use tensorflow 2.0.0 while I have the latest tensorflow, 2.5.0).

here’s the relevant documentation but without an example using interger.

here is my code:

import tensorflow as tf
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D

using a CIFAR dataset sample

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0

Use smaller subset – speeds things up

x_train = x_train[:10000]
y_train = y_train[:10000]
x_test = x_test[:1000]
y_test = y_test[:1000]

define a function that creates a new instance of a simple CNN.

def get_new_model():
model = Sequential([
Conv2D(filters=16, input_shape=(32, 32, 3), kernel_size=(3, 3),
activation=‘relu’, name=‘conv_1’),
Conv2D(filters=8, kernel_size=(3, 3), activation=‘relu’, name=‘conv_2’),
MaxPooling2D(pool_size=(4, 4), name=‘pool_1’),
Flatten(name=‘flatten’),
Dense(units=32, activation=‘relu’, name=‘dense_1’),
Dense(units=10, activation=‘softmax’, name=‘dense_2’)
])
model.compile(optimizer=‘adam’,
loss=‘sparse_categorical_crossentropy’,
metrics=[‘accuracy’])
return model

Create Tensorflow checkpoint object with epoch and batch details

checkpoint_5000_path = ‘/model_checkpoints_5000/checkpoint_{epoch:02d}-{batch:04d}’
checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_5000_path,
save_weights_only = True,
save_freq = 5000,
verbose = 1)

Create and fit model with checkpoint

model = get_new_model()
model.fit(x = x_train,
y = y_train,
epochs = 3,
validation_data = (x_test, y_test),
batch_size = 10,
callbacks = [checkpoint_5000])

It is meant to make and save the filenames including the epoch and batch number.
However, the files are not created. After I create manually this directory, model_checkpoints_5000, no files are added in.
(we can check the contents by running ’ ! dir -a model_checkpoints_5000’ (windows)
or ‘ls -lh model_checkpoints_500’ (linux)).

I have also tried to change to ‘’/model_checkpoints_5000/checkpoint_{epoch:02d}', it still does not save the files with every epoch’s number.

8bitmp3 · January 13, 2022, 1:23am

Hi @Maria_B

Try changing the save_freq parameter in ModelCheckpoint to something smaller, like 1000, since you’re only training over 3 epochs here.

Full code:

import tensorflow as tf
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0

x_train = x_train[:10000]
y_train = y_train[:10000]
x_test = x_test[:1000]
y_test = y_test[:1000]

def get_new_model():
  model = Sequential([
                      Conv2D(filters=16,
                             input_shape=(32, 32, 3),
                             kernel_size=(3, 3),
                             activation='relu',
                             name='conv_1'),
                      Conv2D(filters=8,
                             kernel_size=(3, 3),
                             activation='relu',
                             name='conv_2'),
                      MaxPooling2D(pool_size=(4, 4),
                                   name='pool_1'),
                      Flatten(name='flatten'),
                      Dense(units=32, activation='relu', name='dense_1'),
                      Dense(units=10, activation='softmax', name='dense_2')
                      ])
  model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])
  return model

checkpoint_5000_path = '/model_checkpoints_5000/checkpoint_{epoch:02d}-{batch:04d}'
checkpoint_5000 = ModelCheckpoint(filepath=checkpoint_5000_path,
                                  save_weights_only=True,
                                  save_freq=1000,
                                  verbose=1)

model = get_new_model()

model.fit(x=x_train,
          y=y_train,
          epochs=3,
          validation_data=(x_test, y_test),
          batch_size=10,
          callbacks=[checkpoint_5000])

Your output should hopefully include the following during training:

Epoch 00001: saving model to /model_checkpoints_5000/checkpoint_01-1000
1000/1000 [==============================] ......
Epoch 2/3
 999/1000 [============================>.] ......
Epoch 00002: saving model to /model_checkpoints_5000/checkpoint_02-1000
1000/1000 [==============================] ......
Epoch 3/3
 997/1000 [============================>.] ......
Epoch 00003: saving model to /model_checkpoints_5000/checkpoint_03-1000

Then, check the /model_checkpoints_5000/ folder.

Let us know if this helps.

8bitmp3 · January 13, 2022, 1:26am

Maybe this could also help:

save_freq='epoch'

There is another good example in this tutorial Save and load models | TensorFlow Core

...
batch_size = 32

# Create a callback that saves the model's weights every 5 epochs
cp_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_path, 
    verbose=1, 
    save_weights_only=True,
    save_freq=5*batch_size)
...

Robert_McKean · April 25, 2022, 3:52am

I had the same problem with Tensorflow 2.8. This was the fix (from tensorflow site):

save_freq 'epoch' or integer. When using 'epoch', the callback saves the model after each epoch. When using integer, the callback saves the model at end of this many batches.

So if you put freq=5000, the directory won’t be created because you never reach the checkpoint value (which would be 5000 BATCHES, not 5000 samples). Reduce save_freq to 1. There are only 100 training samples. So if you set Batch=10, you will get 10 checkpoints for each epoch.

That will fix your problem. Good luck!