Legitimate method to run quantized model on server?

hi guys,

I’m trying to optimize my model with 8bit integer quantization for performance.
From what I learned from Post-training quantization  |  TensorFlow Model Optimization
the only way for TF to run a integer quantized model is through the tflite runtime.
I’m trying to deploy the service on the cloud with a powerful CPU server and a bunch of HW accelerators.
Right now we are running with native TF runtime and tfserving. it’s working well.
It sounds that the tflite is not designed for this scenario. also in some article it says the tflite implementation of cpu kernels are not best fit for server.
Please let me know what is the legitimate method to run quantized model on cloud.

Thank you very much.


1 Like