How to use batch normalization in Mobile Net for a quantized TensorFlow lite model?

Olivier_Staehli · June 19, 2021, 10:18am

Hi everyone! I woul like to run a Mobile Net on an embedded system (Coral USB Accelerator) to analyze audio recordings that I convert into spectrograms. I have implemented the model according to the Mobile Net paper in Tensorflow.

def SeparableConv(x, num_filters, strides, alpha=1.0):
     x = tf.keras.layers.DepthwiseConv2D(kernel_size=3, padding='same')(x)
     x = tf.keras.layers.BatchNormalization()(x) 
     x = tf.keras.layers.ReLU()(x)
     x = tf.keras.layers.Conv2D(np.floor( num_filters * alpha), kernel_size=(1, 1), strides=strides, use_bias=False, padding='same')(x) 
     x = tf.keras.layers.BatchNormalization()(x)
     x = tf.keras.layers.ReLU()(x)
     return x

def Conv(x, num_filters, kernel_size, strides=1, alpha=1.0):
    x = tf.keras.layers.Conv2D((np.floor( num_filters * alpha)), kernel_size=kernel_size, strides=strides, use_bias=False , padding='same')(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Activation('relu')(x)
    x = tf.keras.layers.ReLU()(x)
    return x

inputs = tf.keras.layers.Input(shape=(64, 64, 1))

x = Conv(inputs, num_filters=16, kernel_size=3 , strides=2)
x = SeparableConv(x, num_filters=32, strides=1)
x = SeparableConv(x, num_filters=64, strides=2)
x = SeparableConv(x, num_filters=64, strides=1)
x = SeparableConv(x, num_filters=128, strides=2)
x = SeparableConv(x, num_filters=128, strides=1)
x = SeparableConv(x, num_filters=256, strides=1)
x = SeparableConv(x, num_filters=256, strides=2)
x = SeparableConv(x, num_filters=256, strides=1)
x = SeparableConv(x, num_filters=256, strides=1)
x = SeparableConv(x, num_filters=256, strides=1)
x = SeparableConv(x, num_filters=256, strides=1)
x = SeparableConv(x, num_filters=512, strides=2)
x = SeparableConv(x, num_filters=512, strides=1)

x = tf.keras.layers.GlobalAveragePooling2D()(x)
x = tf.keras.layers.Dense(512)(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dropout(0.001)(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dense(2)(x)

outputs = tf.keras.layers.ReLU()(x)
model = tf.keras.models.Model(inputs, outputs)

To run it on the Coral USB Accelerator I need to quantize it and convert it to a TensorFlow Lite model (and then compile it again with the Edge TPU compiler). But the prediction is massively worse after quantizing from 32bit float to 8bit int. The problem seems to be the batch normalization keras.layers.BatchNormalization(). This is not an instruction that is allowed in TensorFlow lite. But since Mobile Net was designed specifically for embedded systems and Batch Normalization is a fundamental part of it I can’t imagine that this is not possible on the Edge TPU. So I wanted to ask if anyone knows a workaround to get Mobile Net, specifically Batch Normalization working on the Coral USB Accelerator?

Many thanks for your help and advice in advance!

George_Soloupis · June 19, 2021, 2:30pm

Hi @Olivier_Staehli

Why do you say that the problem is with Batch normalization? Have you tried without it? Have you tried the model without integer quantization to check it?

Olivier_Staehli · June 19, 2021, 3:02pm

Hi @George_Soloupis, Thanks for your questions. I have tried without it but the results did not made sense since Batch normalization seems to be a crucial part of mobile net. However, I have built a network with only a few layers and checked each part of mobile net separately. There, I figured out that batch normalization is the issue (also GlobalAveragePooling() causes the result to be less accurate but it is not very significant and can be replaced by MaxPooling). The results of the non-quantized TensorFlow Lite model are the same as the TensorFlow model. Therefore, I concluded that it must be BatchNormalization in the integer quantization step.

George_Soloupis · June 19, 2021, 3:33pm

Nice!!

It seems that you have tried a lot! I will search myself for this particular issue when BatchNormalization is used during full integer quantization. In the meantime I will tag also @Sayak_Paul and hope that he has done full integer quantization to models that contain BatchNormalization to shed some light. I think though that for this particular issue you have to provide a notebook to check the structure and the representative dataset that is used.

Sayak_Paul · June 19, 2021, 3:58pm

I would try with an increased representative dataset.

Also, when you are using the integer quantized model, it’s important to account for the scale and offset as shown here:

BatchNormalization should not be an issue here.

Batch Normalization is well-supported actually. Refer here:

Also, please try with the most stable version of the libraries that you are using.

Olivier_Staehli · June 20, 2021, 7:30am

Thanks a lot for your support! Here is a link to the Jupyter Notebook
(This is my first ML project, so it might be possible that I made a beginner mistake somewhere)

This is the training and validation data set

Olivier_Staehli · June 20, 2021, 7:30am

And this is the independent test set

Olivier_Staehli · June 20, 2021, 7:36am

Many thanks @Sayak_Paul ! The representative dataset seems to be the issue!

I just use this method:

def representative_data_gen():
     for input_value in tf.data.Dataset.from_tensor_slices(test_audio).batch(1).take(250):
          yield [input_value]

When I change the values e.g. from 100 to 250 the predictions change drastically (for the worse). Is there any method or formula to figure out how large the representative dataset should be? I also tried quantization-aware training and could not get a better result. Therefore I thought it cannot be something in the actual quantization process. But I was wrong with that assumption.

Sayak_Paul · June 20, 2021, 8:02am

If quantization-aware is giving you worse results then I suspect there’s something definitely wrong in your data input pipeline. I would investigate the data preprocessing functions really carefully and would also investigate their descriptive statistics. Is test_audio preprocessed 'cause the TFLiteConverter would expect it is. It’s a different story if your model consists the preprocessing layers inside of it already.

It is really dependent on the dataset and the problem you are using. Also, how much quality you can afford to lose. In my experience, a sample size of 100 - 1000 usually works. Therefore, I would recommend plotting a graph of Representative Dataset Size vs. Model Performance and pick one from there.

Lastly, I would come back to this suggestion: