Step-by-step model cross attention training

https://i.postimg.cc/SRfyT8sc/cross-attention.png
Recently, the above model has been tried based on TensorFlow keras, and the problem of data classification accuracy has dropped significantly.
Want to perform fusion training of cross attention between two feature data,But the training classification accuracy is only 49%,however,The accuracy before fusion is 94%.
I am very confused, please give some adjustment suggestions from professional friends.