Tensorflow code running on colab TPU but not on Kaggle/GCE TPU-VMs

Here are two notebooks running the same code on TPU, one on Kaggle and one on Colab.

On Kaggle, the code is throwing this exception whereas the code runs flawlessly on Colab.

NotFoundError: {{function_node __inference_call_113}} No registered 'TemporaryVariable' OpKernel for 'TPU' devices compatible with node {{node ArithmeticOptimizer/AddOpsRewrite_Add/tmp_var}}
	.  Registered:  device='CPU'

	 [[ArithmeticOptimizer/AddOpsRewrite_Add/tmp_var]]

This is also the same exception that TPUs on GCE throw.

The exception is raised when this line is executed:

hs = tf.add(hs, residual)

I have also tried taking out the embedding layer completely, generating embeddings in a dummy batch generator. This change fixed the call function but raised the same error in the train_step method when computing gradients. This leads me to believe this error is not caused by the embedding layer, but I could be wrong.

I would love to know why this exception is being raised and why Colab can run the code but Kaggle and GCE runtimes cannot.

Any help to point me in the right direction to solve this bug would be greatly appreciated.

Thanks!

Which version of Tensorflow do you use on all of the notebooks? Is it the same? I have run into some instability in the lists of operators and use cases that are completely implemented by the various back-end tensor compilers. Also, the TPU code may not completely work on the latest TF release for a few days or a week, especially if it is a 2.x.0 release.

It really does help to hard-code the TF version at the beginning of all of your scripts.

2 Likes

Hello Lance_N,

Thank you, hard coding the TensorFlow version to 2.7.0/2.8.0 using this worked like a charm on Kaggle.

!pip install cloud-tpu-client
import tensorflow as tf 
from cloud_tpu_client import Client
print(tf.__version__)
Client().configure_tpu_version(tf.__version__, restart_type='ifNeeded')

Unfortunately, on GCE, using TPU Client Version Switching did not work at all. It kept asking me to pass the TPU Name to Client() and both the actual tpu name and “local” did not work. It said that it could not retrieve the metadata.

When creating the TPU VM, I tried using the latest tpu-vm-tf-2.8.0 version and couple others but it kept throwing the same exception, even though the tf.version was 2.8.0/2.7.0.

NotFoundError: {{function_node __inference_call_113}} No registered 'TemporaryVariable' OpKernel for 'TPU' devices compatible with node {{node ArithmeticOptimizer/AddOpsRewrite_Add/tmp_var}}
	.  Registered:  device='CPU'

	 [[ArithmeticOptimizer/AddOpsRewrite_Add/tmp_var]]

I would mean a lot to me if you could help me out once again.

I have not worked on Kaggle, and have only run the TPU examples. I do not know enough to help you.