Legitimate method to run quantized model on server?

hi guys,

I’m trying to optimize my model with 8bit integer quantization for performance.
From what I learned from Post-training quantization  |  TensorFlow Model Optimization
the only way for TF to run a integer quantized model is through the tflite runtime.
I’m trying to deploy the service on the cloud with a powerful CPU server and a bunch of HW accelerators.
Right now we are running with native TF runtime and tfserving. it’s working well.
It sounds that the tflite is not designed for this scenario. also in some article it says the tflite implementation of cpu kernels are not best fit for server.
Please let me know what is the legitimate method to run quantized model on cloud.

Thank you very much.

Kevin

1 Like

@Hengwen Welcome to Tensorflow Forum!

You’re correct that TFLite might not be the best option for running your quantized model on a powerful cloud CPU server with hardware accelerators. Here are some legitimate methods for your scenario:

  • Native TensorFlow with quantization-aware training (QAT) - Train your model with QAT (TensorFlow provides APIs for this), which allows optimizing the model for quantization during training. This can lead to better performance and accuracy compared to post-training quantization. You can still use TensorFlow serving directly in this case, potentially leveraging CPU optimizations offered by TensorFlow itself.

  • XLA (Accelerated Linear Algebra) is TensorFlow’s just-in-time (JIT) compiler that can significantly improve performance on various hardware, including CPUs and GPUs. XLA can also work with quantized models and might provide better performance than TFLite on your targeted hardware.
    Consider exploring XLA libraries and tools to optimize your model for specific accelerators available on your cloud platform.

  • Hybrid approach: - Consider a hybrid approach where you use different frameworks for different parts of your inference pipeline. For example, you could use TFLite for mobile deployments but switch to another framework like XLA or OpenVINO for cloud CPU inference with hardware accelerators.

Let us know if any one of the above approach solves the issue.