TFLite dequantization memory problem

the_q_u · May 2, 2024, 2:30am

Hello everyone.

I used the code
“converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
converter.target_spec.supported_types = [tf.float16]”
to convert my model to tflite file in fp16.

As I know, when a fp16 tflite file is executed on a smartphone, it undergoes dequantization to fp32, and the computations begin in fp32.

However, when I analyzed the operation order of this tflite file, it appeared that dequantization occurred early on, which increased the peak memory usage.

The operation order like this:
(T#1, T#2, T#3, T#4 are dense layers)

Subgraph#0 main(T#0) → [T#12]
Op#0 DEQUANTIZE(T#1) → [T#5]
Op#1 DEQUANTIZE(T#2) → [T#6]
Op#2 DEQUANTIZE(T#3) → [T#7]
Op#3 DEQUANTIZE(T#4) → [T#8]
Op#4 FULLY_CONNECTED(T#0, T#5, T#-1) → [T#9]
Op#5 FULLY_CONNECTED(T#9, T#6, T#-1) → [T#10]
Op#6 FULLY_CONNECTED(T#10, T#7, T#-1) → [T#11]
Op#7 FULLY_CONNECTED(T#11, T#8, T#-1) → [T#12]
…

Is there a way to do dequantization on a layer just before it is used to reduce the peak memory?

Thank you.

Aniket_Dubey · May 14, 2024, 10:20am

Hi @the_q_u ,

You’re absolutely right. Early dequantization in your TFLite model can lead to increased peak memory usage, even though the model itself is stored in a lower precision format (fp16).

While there’s no direct way to force dequantization right before each layer in TFLite currently, there are a few approaches you can explore:

Quantization-Aware Training (QAT)

This technique trains the model while keeping the quantization process in mind. It can potentially lead to better model behavior during post-training quantization. Consider using a quantization-aware training library like TensorFlow Lite Micro’s quantization_utils to see if it improves peak memory usage.

Layer-by-Layer Quantization (if supported by your framework)

Some frameworks might offer tools for layer-by-layer quantization. This allows you to specify which layers should be dequantized just before use. However, this functionality might not be widely available yet.

Model Pruning

This technique removes redundant connections in the model, effectively reducing the number of computations and potentially lowering peak memory usage. Tools like TensorFlow Model Maker offer pruning functionalities.

Experiment with different quantization strategies

TFLite offers different quantization strategies like Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Experiment with these options to see if they impact peak memory usage.

The effectiveness of these approaches depends on your specific model and hardware. Evaluating and comparing different options will help you find the best solution for your scenario.

You can try these methods I found on the Internet they might work for you.

Thank You !