How to use warmstart_embedding_matrix with TFT vocabularies

rclough_spotify · March 30, 2023, 3:46am

I have a general question about how to do warm starting with models using text features, but in the context of TFX. I’ve tried to simplify the real problem/question into something smaller/idealized. If I have:

A Keras Model
That uses TFT, including feature(s) that use compute_and_apply_vocabulary to transform text into a one-hot encoded representation

How can I do warm start with TFX, assuming that the vocabulary will slightly change over time? I know the following:

base_model is an input to the trainer component that you can use to load a previous model to do warm starting
there exists a warmstart_embedding_matrix, which seems built for this purpose

But what isn’t clear is how to correctly connect TFT vocabularies from old models + new models with this warmstart embedding matrix, as the examples seem to assume a simplified scenario, not to mention seems specific to the use of TextVectorization, which is a bit different than the TFT vocab solutions

Robert_Crowe · April 6, 2023, 6:10pm

TFT doesn’t have a solution for vocab warmstarting out-of-the-box (only implicitly through operating on a rolling range of data), but Trainer may offer a solution for this. Checking now.

rclough_spotify · April 6, 2023, 6:26pm

Trainer allows you to provide a warm start model (base_model), from which you could extract the old TFT graph and ostensibly the old vocabularies. It seems like warmstart_embedding_matrix is designed to solve the problem of changing vocabulary, which seems appropriate to address TFT, but it’s just not very clear how to wire it up correctly, especially since some TFT vocabulary uses hash buckets (OOV etc)

Divya_S · April 25, 2023, 6:41pm

@rclough_spotify
The tf.keras.utils.warmstart_embedding_matrix function provided by TensorFlow expects an old vocabulary and a new vocabulary. These vocabularies can be in the form of an array/tensor or a text file. The new vocabulary may have different tokens, order, or size compared to the old vocabulary. The order of the tokens is used as the lookup index for the embedding matrix. If you have changing vocabularies during training, you can specify the old and new vocabularies as base_vocabulary and new_vocabulary, respectively.

The base_embeddings parameter represents the currently trained embedding matrix, and new_embeddings_initializer represents the initialization values for the new embeddings corresponding to the new vocabulary. The utility function will return a remapped embedding matrix, which you need to assign to the embedding matrix of your model’s layer, as demonstrated in the guide. Then, you can resume training your model.