Is it normal to take 20-70 ms to predict on a sentence for Transformer?

Martin · August 28, 2021, 1:19am

I have a TF1.14 text classification problem, where each text is short since they are from social media. It uses a standard Transformer architecture from the Bert github:

In predictions, some text takes up to 70 ms, and most takes around 20 ms. I am running the predictions for each single example.

Is this normal for Bert’s prediction for a typical text classificsation?

Martin · August 28, 2021, 3:20am

Also, on the same machine, I run the same program on the same data, the time (average) spent on predictions differs quite a lot.

lgusm · August 29, 2021, 1:26am

Hi Martin,

This is a very hard question to answer. There’re many BERT models with very different sizes and complexity and this affects how fast the model can do inference. Also the machine in which you are running inference impacts the time. Do you have a gpu to help? Which one?
So it’s not easy to answer that.

What I’d do to try to optimize the process is try the newest version of TF and BERT models.
You can try a more modern BERT model following this guide: Making BERT Easier with Preprocessing Models From TensorFlow Hub — The TensorFlow Blog

if the timing is to high, you can try a smaller BERT and still keep a high accuracy.

Martin · August 29, 2021, 4:52pm

Igusm, thanks for the info. I do have GPU on servers, but in development I prefer to running models on my Mac with CPU only for a small set of data. Do you have suggestions on how to use cpu/gpu in development stage?

Another thing I noticed related to my question above is that, when I run the models on the same machine with the same data, the time could be significantly different. On one run, each prediction may take 30 or 40 ms on average, and another run it may take 20 ms. Also, for similar text length, one prediction’s time differ quite a lot from another prediction. It seems the prediction’s times are unstable. Is this normal?

lgusm · August 30, 2021, 10:44am

It shouldn’t be unstable but it’s a complicated metric. There are a lot of moving pieces when running the model on cpu.

One suggestion to help you is maybe use Google Colab. It gives you free GPU and that helps a lot.