Detail information about DRQ Tflite model

Hi everybody,
I read the post-training quantization pages of Tensorflow and need some help on understanding it well.
There are some vague points for me in these links. Consider a model with a single Convolution layer followed by ReLU activation (or any other activation) and I want to apply post-training Dynamic range quantization on that. As it has been mentioned in those links, in DRQ, weights are going to be saved in int8 and activations will be saved in floating point. Also, it has mentioned that Activations will be quantized and dequantized on the fly for the int8 supported operations. My questions:
1- What does it mean by “The activations are always stored in floating point” in those links? Does it mean that the output tensor of the activation function stored in floating point? or there is something else related to the activation that sored in floating point. In general, what does it mean by “Activations”
2- As the input of the model is in floating point, I suppose that the Convolution computation will be done in floating point and it dequantized weights to floating point (or probably there is no need to dequantize the weights)? Can you give me some information about these computations?
3- Is the output of the Convolution computation (before applying activation) in floating point?
Thank you in advance.

Official TensorFlow 2.16 + Python 3.12 Documentation via JARaaS Hybrid RAG - 6/17/2024
Note: Sources at the end of the response.

Understanding Post-Training Dynamic Range Quantization (DRQ) in TensorFlow

Let’s break down the key aspects of applying post-training Dynamic Range Quantization (DRQ) to a model with a single Convolution layer followed by a ReLU activation function:

1. Meaning of “The activations are always stored in floating point”

In the context of DRQ:

  • “Activations” refer to the outputs of the activation functions applied within the neural network layers.
  • When it’s mentioned that “The activations are always stored in floating point,” it means that the output tensors resulting from activation functions (like ReLU) are stored and processed in floating point format during computation, even though activations may be temporarily quantized to int8 for certain operations to optimize performance.

2. Computational Details and Weight Dequantization

When using DRQ:

  • Weights Quantization: Weights of the model are statically quantized to int8 format during the quantization process and saved this way.

  • Activations Quantization: Activations are quantized to int8 format dynamically during the inference but the stored format remains as floating-point.

  • Model Computations:

    1. Input in Floating Point: The input to the model remains in floating point until it reaches the quantized layers.
    2. Convolution Operation: During a convolution operation, if the operation is supported by int8, the floating-point input data and quantized weights are used. The input is quantized to int8 on-the-fly, the convolution is performed in int8, and then the result is dequantized back to floating point.
    3. Dequantization of Weights: Typically, once computations are to be done, weights are dequantized back to floating point. However, this process depends on the specific hardware optimizations available.

3. Output of Convolution Computation

Before applying any activation function:

  • The output of the convolution computation should generally be in floating point format, obtained after the dequantization step if the convolution operation was performed using int8 weights. The intermediate output between layers (post-convolution and pre-activation) will be dequantized to ensure computational accuracy.

Summary of Workflow

  1. Input Data: Float → Int8 (for compatible layers)
  2. Convolution Weights: Saved as int8 → Dequantized to float for computation
  3. Intermediate Outputs (Pre-Activation): Float post-dequantization
  4. Activations: Int8 → Float for storage, but quantized/dequantized on-the-fly during computation.

Sources:

  • Quantization Concepts: quantization.rst (internal document)
  • Tensor Representations: quantization.rst (internal document)
  • Weight Handling in Tensor Operations: model_files.md (internal document)

If you need further specifics, please make sure to refer to the internal documents as listed above. If there are more intricate queries on internal mechanisms, it may be helpful to consult TensorFlow’s comprehensive quantization documentation or technical support channels.

Thanks alot @Tim_Wolfe ,
Can you add the links of those documents?
Another question is that when we are going to go through the layers just in the forward path at the inference time, do we need to store the activations? I mean as the activations are going to be computed during the inference after each layer’s computation (per input), why should we store them? And then determine in which format we are going to store them.

All documentation in the RAG = TensorFlow Official GitHub dated 6/17/2024

Thank you @Tim_Wolfe