How to Embed Categorical Features Which Are Strings in Tensorflow?

barmanroys · May 5, 2022, 8:54am

Using the embedding layer for processing categorical features. But the documentation says it works only on positive integer outputs of a fixed range.

So is there any out-of-the-box layer for embedding categorical features which are represented as strings, let’s say when the values in city column are ['London', 'Berlin', 'Paris'] etc.?

Two ways I can figure out are

Encoding them one hot, into three dimensional binary vectors. But this defeats the purpose of embedding layer.
Encode each city with an integer, then use the embedding layer.

Moreover, either of them seems to make a code a bit more clunky, as I have to repeat the integer encoding for training and serving, with some possibility of introducing bugs.

Is there any other layer that will allow me to embed string categorical features?

Also, related, is it possible to embed multiple categorical features columns in one vector space, taking the co-occurrence into account? I am talking about a scenario where a sample of {'city':'London', 'brand':'Apple'} will get embedded in the same vector space as another {'city':'Paris', 'brand':'Samsung'}?

Note that I am aware of word-2-vec embedding in natural language processing, which can provide similar functionality if I pass the strings through something like a glove model. But here, I am seeking a way to featurise the categories of different strings, do not want to model the meanings of the words, as a glove will probably do.

rcauvin · May 5, 2022, 6:44pm

Keras provides tf.keras.layers.StringLookup to handle encoding and decoding for you. You provide a “vocabulary” of the possible string values. E.g.:

city_lookup = tf.keras.layers.StringLookup(vocabulary = city_vocabulary, mask_token = None);
city_embedding= tf.keras.Sequential([
    city_lookup,
    tf.keras.layers.Embedding(len(city_vocabulary) + 1, embedding_dimension)
  ], "city_embedding")

You can also add an extra slot in the embedding (shown above) to deal with values that aren’t in the predefined vocabulary.

Later, when you “invoke” the embedding (during training or inference), you’ll do something like:

city = features["city"]
city_embedding_output = city_embedding(city)

And you’ll typically tf.concat the output of that and other embeddings before invoking one or more dense layers.

barmanroys · July 28, 2022, 7:21am

Thanks for the response, the StringLookup almost serves my purpose, but with one quirk.

It seems to map the strings to an integer indexed by one, i.e. London, Paris, Berlin will get mapped to 1-3, as opposed to 0-2 expected by the following embedding layer. I can subtract 1 in my code before passing it to the embedding layer, or else I have to set the vocabulary size of embedding layer one more than the actual vocab size. Again, not sure either of them is a clean way to do it.

Can I ask if there is a way to make the StringLookup output zero indexed? Also, it seems a bit weird to make the StringLookUp output one-indexed, defying all conventions in python, is there a specific reason for it?

rcauvin · August 1, 2022, 10:17pm

Are you setting the mask_token to something other than None when you create the StringLookup?

From the StringLookup documentation of the mask_token parameter:

A token that represents masked inputs. When output_mode is “int”, the token is included in vocabulary and mapped to index 0. In other output modes, the token will not appear in the vocabulary and instances of the mask token in the input will be dropped. If set to None, no mask term will be added. Defaults to None.

It might help to post the code.