Quantization : slower inference on Android phone

I have quantized mobilnet_v2 model, using ‘dynamic’ and ‘full int’ quantization techniques, converted the models to tflite (used the same code from tensorflow tutorial) and benchmarked the inference (with benchmark_model) time over CPU and GPU on Android mobile phone.
My results are the following :

On GPU :

  • Dynamic range quantization is slightly faster : 0.50 ms faster
  • Full Int quantization is slightly slower : 0.20 ms slower

On CPU (4 Threads )

  • Dynamic range quantization is really slow : 8 ms slower
  • Full Int quantization is slightly slower : 0.30 ms slower

Can anyone please tell why in this case quantization is not accelerating the model ? Has anyone encountered the same problem with same/other model/s ?

Thank you


Both approaches work on the Android device and yield the expected results. Yet, to my great surprise, the inference with the TensorFlow Lite interpreter takes at least twice as long as the inference with the TensorFlowInterface (on the same device, of course). I checked this on various devices, and the results are similar in all cases.

1 Like

And did you get any answers about the reason the quantized models are taking this much time with the Tf Lite Interpreter or still no explanations?
Thank you :smile: