TextVectorization significantly slower than sklearn's CountVectorizer

We are trying to replace sklearn CountVectorizer with TextVectorization. We experimented with a text dataset of 400K sentences with unigrams and bigrams. We used adapt to build the vocabulary.

We wrote a small Keras model as shown in the documentation with a vectorization layer and compiled the model.

The predict is significantly slower (by 500 times on a desktop, no GPUs) than sklearn’s transfrom function. Any suggestions?

Thank you!

I would guess this is an issue with output representation. By default, TextVectorization will use a dense Tensor output–this is simpler for small examples, and guarantees the output of the layer can be used with any other keras layer. But when using output_mode='count' (or 'multi_hot' or 'tf_idf') and a large vocabulary (vocab is probably quite large when using bigrams), this dense output gets very inefficient very quickly.

We recently added an option for sparse output from the TextVectorization layer. This is currently on only in tf-nightly, and will be available in a stable release in tensorflow 2.7.

tf.keras.layers.TextVectorization(output_mode='count', sparse=True)

The sparse output from the layer constructed like this could then be fed into a tf.keras.layers.Dense, and scale up much more effectively.

1 Like

Thanks Mathew. Setting sparse=True brings it on par with sklearn.

1 Like