Manual updating of weights using gradienttape and optimizer

i have some doubts… i need someone who has experience working with tensorflow and keras

my problem is: im working on a reinforcement learning system. where i generally collect experiences in batches then backpropagate using rewards. i use my own custom loss function. I USE EPSILON-GREEDY strategy for exploration.
now, the problem is, my model is predicting one one class with high probability all the time and i guess the problem is, my model is not caring about random actions at all. its because, i calculate the loss using the probabilty of the action taken so, if i take random probability from prediction, lets say [0.1,0.3,0.6] here for example my random action is 0, which has probability 0.1. assume it is the correct prediction of that state, so i give high reward and low loss, NOW MY MODEL IS BEING IN A WAY THAT IT THINKS IT GOT HIGH REWARD DUE TO ITS ARGMAX VALUE ie, 0.6 probabilty or action 3 and it is not knowing that it got that high reward because of random action 0.

i just wanna know how to solve this. as this is happening from past week. i tried all methods suggested from chat-gpt and gemini, they arent working , now i need someone who is good at keas and tensorflow