Hello everyone! : )
I had a working recommender system model written in tensorflow-gpu==2.3.0 I tried to migrate it to tensorflow-gpu==2.11.0 and when I train it, it trains well for few batches, and then it starts to throw me the following error-
2023-03-22 13:17:54.385699: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2023-03-22 13:17:54.385710: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (1): Tesla T4, Compute Capability 7.5
2023-03-22 13:17:54.385716: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (2): Tesla T4, Compute Capability 7.5
2023-03-22 13:17:54.385722: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (3): Tesla T4, Compute Capability 7.5
2023-03-22 13:17:55.180084: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-03-22 13:17:59.593833: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2023-03-22 13:18:57.142924: E tensorflow/compiler/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2023-03-22 13:18:57.232398: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
Aborted (core dumped)
The details of my current environment where the code breaks-
AWS g4dn.12xlarge instance (4 x T4 GPUs)
version of tensorflow 2.11.0
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
nvidia-smi
Wed Mar 22 13:21:22 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |
| N/A 40C P8 14W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |
| N/A 39C P8 14W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |
| N/A 41C P8 15W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 39C P8 14W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
What is the issue and how can I fix it?
Any help would be much appreciated! : )