Where to see detailed logs of TPU pod errors

river_shah · February 24, 2023, 5:11pm

For a TPUStrategy running on larger pods I am getting this cryptic error message. Model works fine on smaller datasets. How do I check by logging onto one of the TPU machines or another way what the true error was. This is the entire traceback I am getting making it near impossible to figure out what is going on:

    history = model.fit(
  File "/usr/local/lib/python3.10/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py", line 1123, in _numpy
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.UnavailableError: Socket closed
Exception ignored in atexit callback: <function async_wait at 0x7f2c5b06eb00>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/context.py", line 2673, in async_wait
    context().sync_executors()
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/context.py", line 717, in sync_executors
    pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.UnavailableError: 2 root error(s) found.
  (0) UNAVAILABLE: Socket closed
  (1) UNAVAILABLE: Connection reset by peer
0 successful operations.
0 derived errors ignored.