TPUStrategy running on larger pods I am getting this cryptic error message. Model works fine on smaller datasets. How do I check by logging onto one of the TPU machines or another way what the true error was. This is the entire traceback I am getting making it near impossible to figure out what is going on:
history = model.fit( File "/usr/local/lib/python3.10/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler raise e.with_traceback(filtered_tb) from None File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py", line 1123, in _numpy raise core._status_to_exception(e) from None # pylint: disable=protected-access tensorflow.python.framework.errors_impl.UnavailableError: Socket closed Exception ignored in atexit callback: <function async_wait at 0x7f2c5b06eb00> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/context.py", line 2673, in async_wait context().sync_executors() File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/context.py", line 717, in sync_executors pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle) tensorflow.python.framework.errors_impl.UnavailableError: 2 root error(s) found. (0) UNAVAILABLE: Socket closed (1) UNAVAILABLE: Connection reset by peer 0 successful operations. 0 derived errors ignored.