Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).`

maifeeulasad · December 22, 2022, 7:38am

Is there any way, where I can tokenize texts from tf.string with AutoTokenizer from transformers? Cause in this way, we can use transformers inside existing TensorFlow models, and it will be a lot faster.

This also leads to endless possibilities, as we will be able to use multiple models parallel with concat.

Let’s say I have this piece of code:

def get_model():
    text_input = Input(shape=(), dtype=tf.string, name='text')
    
    MODEL = "ping/pong"
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
    transformer_layer = TFAutoModel.from_pretrained(MODEL)
    
    preprocessed_text = tokenizer(text_input)
    outputs = transformer_layer(preprocessed_text)
    
    output_sequence = outputs['sequence_output']
    x = Flatten()(output_sequence)
    x = Dense(NUM_CLASS,  activation='sigmoid')(x)

    model = Model(inputs=[text_input], outputs = [x])
    return model

But this gives me an error saying:

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_27/788693747.py in <module>
      1 optimizer = Adam()
----> 2 model = get_model()
      3 model.compile(loss=CategoricalCrossentropy(from_logits=True),optimizer=optimizer,metrics=[Accuracy(), ],)
      4 model.summary()

/tmp/ipykernel_27/330097806.py in get_model()
      6 
      7     text_input = Input(shape=(), dtype=tf.string, name='text')
----> 8     preprocessed_text = tokenizer(text_input)
      9     outputs = transformer_layer(preprocessed_text)
     10 

/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2466         if not _is_valid_text_input(text):
   2467             raise ValueError(
-> 2468                 "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
   2469                 "or `List[List[str]]` (batch of pretokenized examples)."
   2470             )

ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

Marco_Filippone · January 9, 2023, 3:58am

I’m having the same problem.

I’ve been struggling for half a day on this and just found this issue here: Allow tensorflow tensors as input to Tokenizer · Issue #8495 · huggingface/transformers · GitHub

Now, I’m considering of moving to pytorch ¯_(ツ)_/¯

Marco_Filippone · January 9, 2023, 4:23am

If you are still interested, I managed to work around the issue by writing the tokenizer vocabulary to a file (with the tokens ordered by their ids) and then initializing a BertTokenizer (tfm.nlp.layers.BertTokenizer | TensorFlow v2.11.0) from it

maifeeulasad · January 9, 2023, 9:43am

@Marco_Filippone can you please share your notebook? That would be really helpful. Thanks.