It’s very strange that cache() still uses the internal memory which results in OOM. Here is the memory usage I printed via callback during training.
2022-03-08T22:19:40.154191003Z ...Training: end of batch 15700; got log keys: ['loss', 'copc', 'auc']
2022-03-08T22:19:40.159188560Z totalmemor: 59.958843GB
2022-03-08T22:19:40.159223737Z availablememory: 8.418320GB
2022-03-08T22:19:40.159250296Z usedmemory: 50.959393GB
2022-03-08T22:19:40.159257814Z percentof used memory: 86.000000
2022-03-08T22:19:40.159263710Z freememory:1.072124GB
2022-03-08T22:19:47.752077011Z Tue Mar 8 22:19:47 UTC 2022 job-submitter: job run error: signal: killed
I have tested the code on TF2.3 which has no such an issue, but TF2.5 and onwards have such an OOM issue. I am not sure whether or not this is bug or configuration problem. Could anyone help to answer or give some clues about this problem?