I am having an issue when applying quantization aware training (QAT) on a FP32 model. So, I trained my model and then evaluated it after converted it to TFLite model. I got my metrics from that evaluated model (in TFLite mode). Then, I applied quantization aware training(QAT) and converted it to TFLite. When getting metrics from my evaluated QAT model (in TFLite mode too), I see that FP32 is doing better. This would not be correct as far as I know. Can someone explain this strange behavior?

Hi @MLEnthusiastic, As you mentioned that you have trained the model and converted to tflite.

Have you applied quantization during training, if yes then it is QAT or applied quantization during conversion to tflite then it is Post training quantization.

Generally, the float32 models give better results because of having higher numerical precision. whereas quantization is a process of reducing the precision of weights to a smaller number of bits. As the precision was reduced the model has a limited ability to represent fine details. so fp32 model gives better results than quantizing models. Thank you.

I applied QAT on trained model so in my case I do not think I did PTQ.

So as far as I know, QAT will make the model aware after training it in fp32 and it (the QAT model) memorizes the metrics found in the first trained model (fp32) and fine-tunes them to get better performance, am I right?

So what I did is; train the model, evaluate it and then save it in TFLite and re-evaluate it. Since, I am saving my model in checkpoints, I got it from there and apply QAT on it and then convert it to TFLite and evaluate that TFLite and compare it to TFLite of the trained model without quantization (or more precisely without QAT). does this seem right?
If so, then fp32 will be always better than fine-tuned model?

I know that quantization (8 bits) will lose some precision due to the number of bits comparing to fp32 (32 bits) but QAT should be aware of the fp32 model and increase performance? Is this right?

Hi @MLEnthusiastic, By applying the quantization to the keras model the weights(which are in float 32) will be converted to the nearest 8-bit fixed-point numbers. This results in a smaller model and increased inferencing speed, which is valuable for low-power devices such as microcontrollers. But the performance will not increase then the base model. By converting the weights, activations from float32, to int there will be a little performance drop than the base model because of removing precision.

Quantization works by first finding the minimum and maximum values of the weights in each layer of the model. These minimum and maximum values are then used to create a scale factor, which is used to convert the weights to integers.

The aim of quantization is to convert weights, activation from fp32, to fp16, int 16, int 8. The QAT model will not be aware of fp32 weights. The QAT model makes predictions by using the converted weights, activation (fp16, int16, int8). The QAT will not increase the performance. The QAT model performance will be similar or somewhat less than the base model. The performance drop will depend on the type of quantization applied. Thank You.