[Help!]Using pretrained Embeddings on TPU

Sohail_Mohammad · August 19, 2023, 6:34pm

I am getting this error when i used pretrained embedding layer

Epoch 1/10
---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
<ipython-input-7-e9085b1a50d7> in <cell line: 1>()
----> 1 history = model.fit(train_ds,batch_size=BATCH_SIZE,steps_per_epoch=train_steps,epochs=10,validation_data=valid_ds,validation_steps=valid_steps)

1 frames
/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py in error_handler(*args, **kwargs)
     68             # To get the full stack trace, call:
     69             # `tf.debugging.disable_traceback_filtering()`
---> 70             raise e.with_traceback(filtered_tb) from None
     71         finally:
     72             del filtered_tb

/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py in _numpy(self)
   1107       return self._numpy_internal()
   1108     except core._NotOkStatusException as e:  # pylint: disable=protected-access
-> 1109       raise core._status_to_exception(e) from None  # pylint: disable=protected-access
   1110 
   1111   @property

InternalError: RET_CHECK failure (third_party/tensorflow/core/tpu/graph_rewrite/distributed_tpu_rewrite_pass.cc:2008) arg_shape.handle_type != DT_INVALID  input edge: [id=6646 Func/while/body/_1/input/_1330:0 -> while/cluster_while_body_146058:634]

My Full code:



!pip3 install -q -U tensorflow-text
from IPython.display import clear_output
import tensorflow as tf
import numpy as np
from google.colab import auth
auth.authenticate_user()
import os
import tensorflow_datasets as tfds
import tensorflow_hub as hub
from tensorflow import keras
import tensorflow_text as text
import os

tpu_resolver = tf.distribute.cluster_resolver.TPUClusterResolver("grpc://"+os.environ["COLAB_TPU_ADDR"])
tf.config.experimental_connect_to_cluster(tpu_resolver)
tf.tpu.experimental.initialize_tpu_system(tpu_resolver)
strategy = tf.distribute.TPUStrategy(tpu_resolver)

(train_raw, valid_raw),ds_info = tfds.load(
    name="imdb_reviews",
    split=["train", "test"],
    as_supervised=True,
    try_gcs=True,
    with_info=True
)


BATCH_SIZE = 16 * 8
train_size = ds_info.splits['train'].num_examples # 25000
valid_size = ds_info.splits['test'].num_examples # 25000
train_steps = train_size // BATCH_SIZE
valid_steps = valid_size // BATCH_SIZE


train_ds = train_raw.shuffle(8000)
train_ds = train_ds.repeat()
train_ds = train_ds.batch(BATCH_SIZE,drop_remainder=True)
train_ds = train_ds.prefetch(-1)
valid_ds = valid_raw.batch(BATCH_SIZE,drop_remainder=True)
valid_ds = valid_ds.prefetch(-1)

with strategy.scope():
    load_locally = tf.saved_model.LoadOptions(experimental_io_device="/job:localhost")
    inp_ = keras.layers.Input(shape=[],dtype=tf.string)
    z = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-preprocess/2",load_options=load_locally)(inp_)
    z = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-base-br/1",trainable=True,load_options=load_locally)(z)
    z = keras.layers.Lambda(lambda z: z['default'])(z)
    z = keras.layers.Flatten()(z)
    z = keras.layers.Dense(64,"relu")(z)
    out_ = keras.layers.Dense(1,"sigmoid")(z)
    model = keras.models.Model(inputs=[inp_],outputs=[out_])
    model.compile(loss="binary_crossentropy", optimizer="nadam",metrics=["accuracy"],steps_per_execution=20)

history = model.fit(train_ds,steps_per_epoch=train_steps,epochs=10,validation_data=valid_ds,validation_steps=valid_steps)

I had scourged stackoverflow for the same problem, i got across one post which said their problem was solved by changing the steps_per_epoch. I changed the steps, decreased and increased but i got the same thing again and again

Thankyou in Advance!

lgusm · August 21, 2023, 6:19pm

Hi Sohail,

I think the problem is related to running the preprocessing model on the TPU.
Since you are loading it inside the context of TPU variables, it is executed on TPU and those models don’t work on TPUs

I’d try to adapt your code to something similar to this one: Solve GLUE tasks using BERT on TPU | Text | TensorFlow

The main idea is that the preprocessing is done separately from the model fine tuning. It’s executed as a preprocessing step on the training data.

Can you try it and let me know please?

Sohail_Mohammad · August 23, 2023, 4:06pm

Thanks borther Igusm for responding yes i did go through the glue tasks on bert notebook and also sucessfully created what i wanted.
but still i want to know why it doesn’t work?..because i came across same issue today itself, and i didn’t use any hub layer this time. only inbuilt keras layers. Please go through the code and tell me at which part of the code is the error happening.Please I just want to know for what reasons is the error coming up and also if you could point out some relevant articles where i can learn more about tpu strategies, i will be grateful.

By the way this works perfectly fine with cpu(15:00mins),gpu(3:00mins)
It also works on TPU with centralstoragestrategy (1:50mins) but gives the ret failure on tpu strategy

This is the notebook

# The model
class NLP(keras.Model):

    def __init__(self,en_vec_layer,es_vec_layer,vocab_size=1000,embed_size=128,**kwargs):

        super(NLP,self).__init__(**kwargs)
        self.en_vec_layer = en_vec_layer
        self.es_vec_layer = es_vec_layer
        self.en_embed = keras.layers.Embedding(vocab_size,embed_size)
        self.es_embed = keras.layers.Embedding(vocab_size,embed_size)
        self.en_encoder = keras.layers.LSTM(512,return_state=True)
        self.es_decoder = keras.layers.LSTM(512,return_sequences=True)
        self.out = keras.layers.Dense(vocab_size,"softmax")

    def call(self,inputs):

        en_input = inputs[0]
        es_input = inputs[1]
        en_encoded_out = self.en_vec_layer(en_input)
        es_encoded_out = self.es_vec_layer(es_input)
        en_embed_out = self.en_embed(en_encoded_out)
        es_embed_out = self.es_embed(es_encoded_out)
        encoder_out,*en_state = self.en_encoder(en_embed_out)
        decoder_out = self.es_decoder(es_embed_out,initial_state=en_state)
        dense_out = self.out(decoder_out)
        return dense_out

# implementation
with strategy.scope():
    train_size = 100_000
    valid_size = total_size-train_size
    BATCH_SIZE = 50*8
    en_vec_layer,es_vec_layer = get_layers()
    X_train,y_train,X_valid,y_valid = get_dataset(en_text,es_text,es_vec_layer,train_size=train_size)
    nlp_model = NLP(en_vec_layer,es_vec_layer)
    nlp_model.compile(
        loss="sparse_categorical_crossentropy",
        optimizer="adam",
        metrics=["accuracy"],
        steps_per_execution=50
    )
    train_steps = train_size//BATCH_SIZE
    valid_steps = valid_size//BATCH_SIZE
    nlp_model.build(input_shape=[])

history = nlp_model.fit(X_train,y_train,epochs=10,batch_size=16*8,validation_data=(X_valid,y_valid),steps_per_epoch=train_steps,validation_steps=valid_steps)

I got the same error

Epoch 1/10
---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
<ipython-input-8-115e32c4ee1b> in <cell line: 1>()
----> 1 history = nlp_model.fit(X_train,y_train,epochs=10,batch_size=16*8,validation_data=(X_valid,y_valid),steps_per_epoch=train_steps,validation_steps=valid_steps)

1 frames
/usr/local/lib/python3.10/dist-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
     68             # To get the full stack trace, call:
     69             # `tf.debugging.disable_traceback_filtering()`
---> 70             raise e.with_traceback(filtered_tb) from None
     71         finally:
     72             del filtered_tb

/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py in _numpy(self)
   1126       return self._numpy_internal()
   1127     except core._NotOkStatusException as e:  # pylint: disable=protected-access
-> 1128       raise core._status_to_exception(e) from None  # pylint: disable=protected-access
   1129 
   1130   @property

InternalError: RET_CHECK failure (third_party/tensorflow/core/tpu/graph_rewrite/distributed_tpu_rewrite_pass.cc:2008) arg_shape.handle_type != DT_INVALID  input edge: [id=2074 nlp_text_vectorization_1_string_lookup_1_none_lookup_lookuptablefindv2_table_handle:0 -> cluster_train_function:99]

Thankyou for further responses, i will reply asap and i am sorry for the delayed response as i was studying the notebook on the bert embeddings

yoooo…bro i remember you…you are the one who wrote the starters notebook on the google sign language competition thats going on right now. can’t believe one of the big shots replied to this.

lgusm · August 23, 2023, 5:43pm

let me try to understand better.

You changed your previous model using Universal Sentence Encoder and created the preprocessing model outside of the distribution strategy scope (like the bert+glue tutorial) and it worked? Is that correct?

In terms of error, I think your new model might be on the same situation as before as in there are types that TPU cannot process (eg: string). More information here: Troubleshooting TensorFlow - TPU | Google Cloud

does it makes sense?

Sohail_Mohammad · August 24, 2023, 5:57am

I am sorry sir i wasn’t clear enough…actually The main purpose of opening this thread was done. I originally wanted to make a sentiment model for imdb reviews by using tensorflow hub pretrained embeddings. The book i have been following used Universal-Sentence-Encoder(USE) but the author of the book says it will be slow and use gpu instead of cpu, but i wanted to try on tpu to save more time, then i tried to use USE but couldn’t. then posted this on forum and started looking for resources and came across bert_glue.ipynb and studied it and made it work for my purpose
This is the notebook of using the pretrained bert embeddings for sentiment analysis

Now I am trying something different but still getting the same error.I am using normal keras embedding layer and on simple encoder-decoder network for english to spanish translation. This works with CentralStorageStrategy but not with TPUStrategy. thats what i am curious about.
This is the notebook

If i am getting this error again then the reason is not with hublayers i think but i can’t pin point exactly for what reasons its happening so i am asking you.

Oh so you mean what bert_glue.ipynb is trying to convery is inside the model layers there should be no string to vectorization layers?..like the tpu doesn’t support it?..so that’s why preprocessing is done in the dataset itself?..is it right sir?

No way!!!..I am thankful to you sir and angry at myself, even tho i studied the bert_glue.ipynb notebook, this simple mistake went over my head, after i removed the string inputs and fed the model with text_vectorized inputs and voila its working now…now i understand the answer was not to use string inputs within scope of strategy all along

Thankyou very much for responding

by the way changed to this


# The model( removed the vecotorization layers from model)
class NLP(keras.Model):

    def __init__(self,vocab_size=1000,embed_size=128,**kwargs):

        super(NLP,self).__init__(**kwargs)
        self.en_embed = keras.layers.Embedding(vocab_size,embed_size)
        self.es_embed = keras.layers.Embedding(vocab_size,embed_size)
        self.en_encoder = keras.layers.LSTM(512,return_state=True)
        self.es_decoder = keras.layers.LSTM(512,return_sequences=True)
        self.out = keras.layers.Dense(vocab_size,"softmax")

    def call(self,inputs):

        en_input = inputs[0]
        es_input = inputs[1]
        en_embed_out = self.en_embed(en_input)
        es_embed_out = self.es_embed(es_input)
        encoder_out,*en_state = self.en_encoder(en_embed_out)
        decoder_out = self.es_decoder(es_embed_out,initial_state=en_state)
        dense_out = self.out(decoder_out)
        return dense_out

# implementation
with strategy.scope():
    train_size = 100_000
    valid_size = total_size-train_size
    BATCH_SIZE = 50*8
    en_vec_layer,es_vec_layer = get_layers()
    X_train,y_train,X_valid,y_valid = get_dataset(en_text,es_text,en_vec_layer,es_vec_layer,train_size=train_size) # Added the text vectorization in the get_dataset so all the inputs are just int32
    nlp_model = NLP(en_vec_layer,es_vec_layer)
    nlp_model.compile(
        loss="sparse_categorical_crossentropy",
        optimizer="adam",
        metrics=["accuracy"],
        steps_per_execution=50
    )
    train_steps = train_size//BATCH_SIZE
    valid_steps = valid_size//BATCH_SIZE
    nlp_model.build(input_shape=[])

history = nlp_model.fit(X_train,y_train,epochs=10,batch_size=16*8,validation_data=(X_valid,y_valid),steps_per_epoch=train_steps,validation_steps=valid_steps)