I am using post-training quantization to create 8-bit quantized tensorflow models. The documentation is simple to follow and I achieve results that I am happy with.
I want to learn more about post-training quantization with regards to what actually happens under the hood. I have read https://arxiv.org/pdf/1712.05877.pdf, which i found as a reference for the quantization scheme. My impression is that this mostly describes quantization aware training, but I could be wrong.
Do you have any recommendations for where I can learn more about post-training quantization? I assume the source code is complex to read. I need to be able to understand the algorithm as part of my master thesis.
My current limited understanding is that parameters such as weights are fixed. These can be quantized by using min and max values. Values such as activations and input to the model are dynamic, and the min/max values of these needs to be estimated using a representative dataset and running inference.