Is there any difference between training with model.fit and tf.GrandientTape?

When I try to train a keypoint detection model, I first try tf.GradientTape(). Data loading uses tf.keras.utils.Sequence, code show as below

epoch = 20
learing_rate = 0.001
model = MyModel()
optimzers = tf.keras.optimizers.Adam(learning_rate=learing_rate)

for i in range(epoch):
for j in range(0,my_training_batch_generator.len()): # iterate all image,label
images,labels = my_training_batch_generator.getitem(j) # image(4,224,224,3),label(4,56,56,17)
with tf.GradientTape() as tape:
y_pred = model(images) # get model output(4,56,56,17)
loss = tf.square(labels, y_pred)
loss = tf.reduce_mean(loss)
grads = tape.gradient(loss, model.trainable_variables)
optimzers.apply_gradients(grads_and_vars=zip(grads, model.trainable_variables))

It works very well and achieves good detection results. But I want to make use of the callback, so trying to train with model.fit

def loss_function(y_true, y_pred):
loss = tf.square(y_true-y_pred)
loss = tf.reduce_mean(loss)
return loss

model.compile(
loss = loss_function,
# loss = tf.keras.losses.MSE,
optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate)
)

model.fit(
x=my_training_batch_generator, # tf.keras.utils.Sequence
epochs=epoch,
)

There is no change in any other settings, the accuracy after training is very bad, and I observe that the loss value keeps the same in each epoch.
Is there any difference between the two methods of training? This question has really bothered me for days, thanks for your answer

Hi @SuSei

Yes, There is difference between both training with model.fit and tf.GradientTape. The key differences you can observe in both is used computation method and used batch_size. GradientTape manually computes gradients for each image and then update steps and also Iterates through each image individually, using batch_size=1 whereas model.fit() automatically handles gradient calculation and updates while optimizing the efficiency and uses the default batch size of 32 (unless it’s specified). Other difference can be how data iteration or shuffling happens in both the cases.

So to get the better results as the manually computation you can try by changing the batch_size = 1 as in GardientTape or experiment with the optimized batch_size, shuffle data generation and the learning rate, and can also try 'relu; activation function to prevent exploding gradients. Thank you.