Using the embedding layer for processing categorical features. But the documentation says it works only on positive integer outputs of a fixed range.
So is there any out-of-the-box layer for embedding categorical features which are represented as strings, let’s say when the values in city
column are ['London', 'Berlin', 'Paris']
etc.?
Two ways I can figure out are
- Encoding them one hot, into three dimensional binary vectors. But this defeats the purpose of embedding layer.
- Encode each city with an integer, then use the embedding layer.
Moreover, either of them seems to make a code a bit more clunky, as I have to repeat the integer encoding for training and serving, with some possibility of introducing bugs.
Is there any other layer that will allow me to embed string categorical features?
Also, related, is it possible to embed multiple categorical features columns in one vector space, taking the co-occurrence into account? I am talking about a scenario where a sample of {'city':'London', 'brand':'Apple'}
will get embedded in the same vector space as another {'city':'Paris', 'brand':'Samsung'}
?
Note that I am aware of word-2-vec embedding in natural language processing, which can provide similar functionality if I pass the strings through something like a glove model. But here, I am seeking a way to featurise the categories of different strings, do not want to model the meanings of the words, as a glove will probably do.