I know that RNN(LSTMCell(units)) and LSTM(units) are mathematically equivalent, but I am getting nans for the recurrent layer after few epochs for LSTM while it does not happen in RNN(LSTMCell). Is this due to usage of CuDNN? I recently had my server update CUDA from 11.4 to 11.6. Should one stop automatic update for CUDA if this kind of issue arises?

Below are the versions I am using.
tensorflow 2.9.1
cudatoolkit 11.2.2

I don’t think so its due to the usage of CuDNN and cuda libraries. It is due to the architecture that you have made or may be issue with your dataset. You can share the architecture and code here so that I can look in to it.

Thanks for the reply! It’s an RL simulation environment(Recsim NG) and I am not sure if it would be appropriate to share the code since there are a lot of dependencies between objects.

I have two separate gradient flows and I think that may be the reason why it’s causing gradient to explode. I am wondering whether using layer norm would help.

alright, Yes you should try normalization