I have an aerial imagery dataset(around 250k images for training) and am trying to finetune the efficient net v2 xl architecture trained on imagenet 21k dataset. However, as the architecture is not directly available with keras I am using the pretrained backbone from tensorflow hub. Below is the implementation of my model.
inputs = layers.Input(shape=(512, 512, 3))
model = hub.KerasLayer("https://tfhub.dev/google/imagenet/efficientnet_v2_imagenet21k_xl/feature_vector/2", trainable=True)
x = model(inputs)
# multi - task learning
# 3 outputs using same backbone
output1 = layers.Dense(3, activation='softmax')(x)
output2 = layers.Dense(6, activation='softmax')(x)
output3 = layers.Dense(5, activation='softmax')(x)
model = tf.keras.Model(inputs, [output1, output2, output3])
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(0.001, 10000)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule, clipvalue=2)
sparse_categorical_crossentropy loss is being used to optimize the weights.
I am using 220k images to train the model and the batch size used is 8. At the end of the first epoch I see that the weights are becoming NaN and slowly the loss becomes NaN as well
I believe that the gradients are vanishing and any help to prevent this issue is highly appreciated.