Is there any difference between training with and tf.GrandientTape?

When I try to train a keypoint detection model, I first try tf.GradientTape(). Data loading uses tf.keras.utils.Sequence, code show as below

epoch = 20
learing_rate = 0.001
model = MyModel()
optimzers = tf.keras.optimizers.Adam(learning_rate=learing_rate)

for i in range(epoch):
for j in range(0,my_training_batch_generator.len()): # iterate all image,label
images,labels = my_training_batch_generator.getitem(j) # image(4,224,224,3),label(4,56,56,17)
with tf.GradientTape() as tape:
y_pred = model(images) # get model output(4,56,56,17)
loss = tf.square(labels, y_pred)
loss = tf.reduce_mean(loss)
grads = tape.gradient(loss, model.trainable_variables)
optimzers.apply_gradients(grads_and_vars=zip(grads, model.trainable_variables))

It works very well and achieves good detection results. But I want to make use of the callback, so trying to train with

def loss_function(y_true, y_pred):
loss = tf.square(y_true-y_pred)
loss = tf.reduce_mean(loss)
return loss

loss = loss_function,
# loss = tf.keras.losses.MSE,
x=my_training_batch_generator, # tf.keras.utils.Sequence

There is no change in any other settings, the accuracy after training is very bad, and I observe that the loss value keeps the same in each epoch.
Is there any difference between the two methods of training? This question has really bothered me for days, thanks for your answer