[TPU/XLA] Unable to find the relevant tensor remote_handle

Hi, I’m using this repository (here) for training a CoATNet model on TPUs. I’m using Colab Pro with a High RAM instance.

Colab Reproduction

Code:-

!pip install keras_cv_attention_models
!git clone https://github.com/leondgarse/keras_cv_attention_models.git
!cd keras_cv_attention_models; TF_XLA_FLAGS="--tf_xla_auto_jit=2" python3 ./train_script.py -m coatnet.CoAtNet0 --seed 0 --batch_size 128 -s CoAtNet0_160 --TPU -d 'cifar10' --disable_float16

which should automatically download the dataset and execute the training script.

Error Traceback/stdout :-

>>>> ALl args: Namespace(TPU=True, additional_model_kwargs={}, basic_save_name='CoAtNet0_160', batch_size=128, bce_threshold=0.2, cutmix_alpha=1.0, data_name='cifar10', disable_antialias=False, disable_float16=False, disable_positional_related_ops=False, distill_loss_weight=1, distill_temperature=10, enable_float16=True, epochs=-1, eval_central_crop=0.95, freeze_backbone=False, freeze_norm_layers=False, initial_epoch=0, input_shape=160, label_smoothing=0, lr_base_512=0.008, lr_cooldown_steps=5, lr_decay_on_batch=False, lr_decay_steps=100, lr_m_mul=0.5, lr_min=1e-06, lr_t_mul=2, lr_warmup=0.0001, lr_warmup_steps=5, magnitude=6, mixup_alpha=0.1, model='coatnet.CoAtNet0', num_layers=2, optimizer='LAMB', pretrained=None, random_crop_min=0.08, random_erasing_prob=0, rescale_mode='torch', resize_method='bicubic', restore_path=None, seed=0, summary=False, teacher_model=None, teacher_model_input_shape=-1, teacher_model_pretrained='imagenet', tensorboard_logs='auto', token_label_file=None, token_label_loss_weight=0.5, weight_decay=0.02)
2022-05-21 23:25:23.558107: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[TPU] All devices:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU')]
>>>> Set random seed: 0
2022-05-21 23:25:40.850248: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "NOT_FOUND: Error executing an HTTP request: HTTP response code 404".

>>>> init_model kwargs: {'input_shape': (160, 160, 3)}
>>>> Built model name: coatnet0
>>>> RandAugment: magnitude = 6, translate_const = 0.450000, cutout_const = 28.800000
>>>> Both mixup_alpha and cutmix_alpha provided: mixup_alpha = 0.1, cutmix_alpha = 1.0
>>>> Loss: BinaryCrossEntropyTimm, Optimizer: LAMB
>>>> basic_save_name = CoAtNet0_160
>>>> TensorBoard log path: logs/CoAtNet0_160_20220521-232552
Traceback (most recent call last):
  File "./train_script.py", line 223, in <module>
    run_training_by_args(args)
  File "./train_script.py", line 214, in run_training_by_args
    model, epochs, train_dataset, test_dataset, args.initial_epoch, lr_scheduler, args.basic_save_name, logs=args.tensorboard_logs
  File "/content/keras_cv_attention_models/keras_cv_attention_models/imagenet/train_func.py", line 251, in train
    workers=8,
  File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1315, in graph
    "Tensor.graph is undefined when eager execution is enabled.")
AttributeError: Tensor.graph is undefined when eager execution is enabled.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py", line 2685, in async_wait
    context().sync_executors()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py", line 740, in sync_executors
    pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.NotFoundError: Resource tpu_worker/_AnonymousVar7044/N10tensorflow22SummaryWriterInterfaceE does not exist.
	Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
2022-05-21 23:25:53.361613: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: INVALID_ARGUMENT: Unable to find the relevant tensor remote_handle: Op ID: 6693, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1653175553.358294014","description":"Error received from peer ipv4:10.19.39.50:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 6693, Output num: 0","grpc_status":3}

I tried disabling Eager execution but it doesn’t help.
Can anyone help figure out how this error arises and how I can debug future XLA (admittedly cryptic) errors?