NLP TextVectorization tokenizer

Bondi_French · October 18, 2022, 3:38am

Hi,
In previous version of TF, we could use tokenizer = Tokenizer() and then call tokenizer.fit_on_texts(input) where input was a list of texts (in my case, a panda dataframe column containing a list of texts). Unfortunately this has been deprecated.
Is there a way to replicate this behaviour with TextVectorization?

Additionally how can we split a string by Upper case letters: for instance ‘ListOfHorrorMovies’ ?
I understand I need to use the standardize method of TextVecorization
Thanks

Kiran_Sai_Ramineni · January 18, 2023, 4:38pm

Hi @Bondi_French, You can use tf.keras.layers.TextVectorization layer to replicate the same behavior . For more details please go through the code example below

import re
 
# Initialising list of string
sentences = [
    'ILoveMyDog',
    'ILoveMyCat',
    'YouLoveMyDog!',
    'DoYouThinkMyDogIsAmazing?'
] 
 
# Splitting on UpperCase using re
res_list = []
for sentence in sentences:
  res_list.append(re.findall('[A-Z][^A-Z]*', sentence))
 
# Printing result
processed_sentences=[]
for i in res_list:
  processed_sentences.append((" ".join(i)))
print(processed_sentences)
#output
['I Love My Dog', 'I Love My Cat', 'You Love My Dog!', 'Do You Think My Dog Is Amazing?']

import tensorflow as tf

text_dataset = tf.data.Dataset.from_tensor_slices(processed_sentences)
max_features = 5000  # Maximum vocab size.
max_len = 10
# Create the layer.
vectorize_layer = tf.keras.layers.TextVectorization(
 max_tokens=max_features,
 output_mode='int',
 output_sequence_length=max_len)

# Now that the vocab layer has been created, call `adapt` on the
# text-only dataset to create the vocabulary.
vectorize_layer.adapt(text_dataset)

# Create the model that uses the vectorize text layer
model = tf.keras.models.Sequential()

# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))

# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing
# vocab indices.
model.add(vectorize_layer)

# Now, the model can map strings to integers, and you can add an
# embedding layer to map these integers to learned embeddings.
input_data=[
    'i really love my dog',
    'my dog loves my manatee'
]
model.predict(input_data)

#output
1/1 [==============================] - 0s 297ms/step
array([[6, 1, 3, 2, 4, 0, 0, 0, 0, 0],
       [2, 4, 1, 2, 1, 0, 0, 0, 0, 0]])

Thank You.