Calculation of quantized dense layers

Hello,

I am trying to figure out how exactly quantized dense layers are calculated. I this block detailing the different computations needed.. I wanted to reimplement the code but with the parameters extracted from a TFLite Model. However, I am not seeing the expected results. Here is my code:

import tensorflow as tf
from tensorflow import keras
import numpy as np

i = 16
j = 48
k = 32

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(i, j), batch_size=1))
model.add(tf.keras.layers.Dense(k, activation=None, use_bias=False))

def representative_data_gen():
    dataset = [
        np.array(
            np.random.randint(-10, 10, size=(i, j)),
            dtype=np.float32,
        )
        for s in range(10)
    ]
    for input_value in dataset:
        # Model has only one input so each data point has one element.s
        yield [input_value] 

converter = tf.lite.TFLiteConverter.from_keras_model(model)

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS, tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.target_spec.supported_types = [tf.int8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model_quant = converter.convert()

# Run the model with TensorFlow Lite
interpreter = tf.lite.Interpreter(model_content=tflite_model_quant)
interpreter.allocate_tensors()

tensor_details = interpreter.get_tensor_details()
out_details = interpreter.get_output_details()
input_tensor = np.squeeze(interpreter.get_tensor(input_idx)) #shape: [i, j]
weight_tensor = np.squeeze(interpreter.get_tensor(weight_idx)) #shape: [k, j]
out_tensor = np.squeeze(interpreter.get_tensor(out_idx))

reduction_dim_size = weight_tensor.shape[1]

zp_input = tensor_details[input_idx]["quantization_parameters"]["zero_points"][0]
q_input = tensor_details[input_idx]["quantization_parameters"]["scales"][0]

zp_weight = tensor_details[weight_idx]["quantization_parameters"]["zero_points"][0]
q_weight = tensor_details[weight_idx]["quantization_parameters"]["scales"][0]

zp_out = tensor_details[out_idx]["quantization_parameters"]["zero_points"][0]
q_out = tensor_details[out_idx]["quantization_parameters"]["scales"][0]

result = (zp_out +
    ((q_input*q_weight)/q_out) +
    np.matmul(input_tensor, weight_tensor.T)
    - zp_weight * np.sum(input_tensor.astype(np.int32), axis=1, keepdims=True)
    - zp_input * np.sum(weight_tensor.T.astype(np.int32), axis=0, keepdims=True)
    + reduction_dim_size*zp_input*zp_weight)
result = np.round(result, decimals=0)
result = np.clip(result, a_min=-128, a_max=127)
result = result.astype(np.int8)

When I run it with this, the result matrix differentiates from out_tensor, all values saturate. What am I doing wrong here? Is it maybe related to matmul using the transpose of the weight matrix, should I handle this differently?
The largest values come from the zp_input * np.sum(weight_tensor.T.astype(np.int32), axis=0, keepdims=True) term, so I suspect that this one is doing something wrong, but I don’t see what would be problem here. Does anyone have any ideas what I am doing wrong here?

Hi @Necrotos,
While trying to reproduce the issue, i encountered a few unknown variables: input_idx, weight_idx, out_idx.
Please provide the details of those variables.

Thank You