Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Hello everyone! : )

I had a working recommender system model written in tensorflow-gpu==2.3.0 I tried to migrate it to tensorflow-gpu==2.11.0 and when I train it, it trains well for few batches, and then it starts to throw me the following error-

2023-03-22 13:17:54.385699: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2023-03-22 13:17:54.385710: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (1): Tesla T4, Compute Capability 7.5
2023-03-22 13:17:54.385716: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (2): Tesla T4, Compute Capability 7.5
2023-03-22 13:17:54.385722: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (3): Tesla T4, Compute Capability 7.5
2023-03-22 13:17:55.180084: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-03-22 13:17:59.593833: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
2023-03-22 13:18:57.142924: E tensorflow/compiler/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2023-03-22 13:18:57.232398: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
Aborted (core dumped)

The details of my current environment where the code breaks-

AWS g4dn.12xlarge instance (4 x T4 GPUs)

version of tensorflow 2.11.0

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
nvidia-smi
Wed Mar 22 13:21:22 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   40C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   39C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   41C    P8    15W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   39C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

What is the issue and how can I fix it?
Any help would be much appreciated! : )

Hi @n0obcoder

Welcome to the TensorFlow Forum!

After installing tensorflow 2.11, you also need to install the compatible version of CUDA and cuDNN to enable GPU support as mentioned in this tested build configuration for GPU which is causing the above error.

Please try again and let us know if the issue still persists. Thank you.