This is a very hard question to answer. There’re many BERT models with very different sizes and complexity and this affects how fast the model can do inference. Also the machine in which you are running inference impacts the time. Do you have a gpu to help? Which one?
So it’s not easy to answer that.
Igusm, thanks for the info. I do have GPU on servers, but in development I prefer to running models on my Mac with CPU only for a small set of data. Do you have suggestions on how to use cpu/gpu in development stage?
Another thing I noticed related to my question above is that, when I run the models on the same machine with the same data, the time could be significantly different. On one run, each prediction may take 30 or 40 ms on average, and another run it may take 20 ms. Also, for similar text length, one prediction’s time differ quite a lot from another prediction. It seems the prediction’s times are unstable. Is this normal?