Add quantization output configuration for QAT

hello, I have the following questions related to Quantization Aware Training (QAT)

In the documentation for QAT configs: tfmot.quantization.keras.QuantizeConfig  |  TensorFlow Model Optimization

it says:
‘’’ In most cases, a layer outputs only a single tensor so it should only have one quantizer’’

in the case I have a multiple outputs in one layer, should I return a list where every element inside is the respective quantizer for every ordered output?

  1. How can I simulate exactly the way tflite makes the quantization during training?, you have different options here (Last Value, All Values and Moving average): Module: tfmot.quantization.keras.quantizers  |  TensorFlow Model Optimization

in the case of convolutions, dense, batch normalization or simple max layers which of these techniques is used by tflite to apply the quantization for the respective layer

how do you configurate this properly, why would I need Moving average for the activations and Last value for the weights, do you have any documentation I can read

  1. Does it make sense to annotate certain operations:

for example suppose I have element-wise max activations instead of relu, does it make sense to quantize the output of these? the output of the previous layer probably is int8 so if there is no posterior transformation to the values but is a slice and selection or max, do I need to necessarily to quantize these ops, what about maxpooling, or paddings.

@Jaehong_Kim Could you answer this question?

  1. Current, QAT API doesn’t support multiple output quantizer on QuantizeConfig. (We have to supports at some point.)

  2. Our default QAT API scheme follows some practice as a paper on bottom of this doc : Quantization aware training  |  TensorFlow Model Optimization
    But QAT API supports custom quantization scheme, so you can change the quantizer by create new scheme or quantizeConfig if you want to.

QAT default scheme is only for TFLite int8 quantization. (But API supports other deployment logic by implements scheme.) This QAT default scheme is sometimes changed when we add some optimized kernels on TFLite. (e.g. if we added FC-Relu fused layer to TFLite, then scheme also reflect that and remove fakequant between FC and relu.), And documentation for this underlying logic is not ready yet.

  1. It also depends on TFLite converter and kernel.
    I know it’s not easy to find where you should add quantizer.
    There’s some tip to know where you should add quantizer:

A. Run PTQ and see tflite structure. most of common case, the result from QAT also very similar result except their weights. You may have to add quantizer between PTQ ops.
B. Add quantizer and then run QAT.

  • B-1. If result TFLite contains useless quantize op, then remove quantizer on that location.
  • B-2. If some op is not quantize, then we have to add more quantizer around that op.
1 Like