My question is related to quantize the activations to e.g. int8 using a representative dataset. Assume a sigmoid activation function in the following.
What confuses me: Which parameters are scaled&shifted / quantized?
I understand that we monitor the common output activations to determine min and max ranges. What do we do with these values? My intuition would say, that we need to scale the entire activation function, because in case of a sigmoid, quantized input (e.g. some high int8 value caused by quantized weights) would always lead to saturated values (near 1 or 0).
from a lot of sources I went through now, I do not understand what exactly is quantized in the activations. Assume a sigmoid activation function. I can observe (using a representative dataset) a bunch of activation outputs, using the unquantized float32 weights. Then I determine the min,max range, but what do I do with that determined scale factor afterwards? My intuition says I need to quantize the entire activation function somehow (or dequantize the activation input), because e.g. a regular sigmoid cant deal with quantized (e.g. int8) input values and would always end in the saturated area. In other words, I dont really get the quantization of activations. Please clear up my confusion.