Getting word ids back to strings

fepac · September 13, 2022, 1:45am

Hi there!

My idea is to see which words my network pay attention the most. The problem is the following (check the code below):

# Get the preprocessor from TF Hub
tfhub_handle_preprocess = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)

# Tokenize the text
text_test = ['Where are you going?']
text_preprocessed = bert_preprocess_model(text_test)

The text_processed variable has three keys: input_word_ids, input_mask and input_type_ids. When you see the output generated by input_word_ids, those are integers (which is fine), but, in the documentation available for this preprocessor, there is no way to get back those integer to the “token representation”.

Just for clarity: if the code outputs something like this

print(text_preprocessed["input_word_ids"][0, :12])
>> [ 101 2073 2024 2017 2183 1029  102    0    0    0    0    0]

Then, I should get something like this:

['w', '##hee', '##re', 'are', 'you', 'going', '?']

The unique thing that I’ve got are the special tokens using this code:

preprocessor = hub.load(tfhub_handle_preprocess)
preprocessor.tokenize.get_special_tokens_dict()
>> {'start_of_sequence_id': <tf.Tensor: shape=(), dtype=int32, numpy=101>,
 'mask_id': <tf.Tensor: shape=(), dtype=int32, numpy=103>,
 'end_of_segment_id': <tf.Tensor: shape=(), dtype=int32, numpy=102>,
 'padding_id': <tf.Tensor: shape=(), dtype=int32, numpy=0>,
 'vocab_size': <tf.Tensor: shape=(), dtype=int32, numpy=30522>}

Thank you, people.

Kiran_Sai_Ramineni · January 23, 2023, 11:15am

Hi @fepac, I don’t think you will get the exactly the same output you want, i depends the vocab file of the model. For more details please refer to this gist. Thank You.