Quantization : slower inference on Android phone

AssiaN17 · June 2, 2022, 11:13am

Hello,
I have quantized mobilnet_v2 model, using ‘dynamic’ and ‘full int’ quantization techniques, converted the models to tflite (used the same code from tensorflow tutorial) and benchmarked the inference (with benchmark_model) time over CPU and GPU on Android mobile phone.
My results are the following :

On GPU :

Dynamic range quantization is slightly faster : 0.50 ms faster
Full Int quantization is slightly slower : 0.20 ms slower

On CPU (4 Threads )

Dynamic range quantization is really slow : 8 ms slower
Full Int quantization is slightly slower : 0.30 ms slower

Can anyone please tell why in this case quantization is not accelerating the model ? Has anyone encountered the same problem with same/other model/s ?

Thank you

#help_request

Tina_Sabri · June 3, 2022, 6:38am

Both approaches work on the Android device and yield the expected results. Yet, to my great surprise, the inference with the TensorFlow Lite interpreter takes at least twice as long as the inference with the TensorFlowInterface (on the same device, of course). I checked this on various devices, and the results are similar in all cases.

AssiaN17 · June 3, 2022, 7:43am

And did you get any answers about the reason the quantized models are taking this much time with the Tf Lite Interpreter or still no explanations?
Thank you