Tensorflow Model Optimization

Hello, everyone!

I am currently trying to use the TensorFlow model optimization library to reduce the memory footprint of the models my workplace is running in our production environment. Specifically, I am testing the tfmot.sparsity.keras.prune_low_magnitude API to prune model weights and get the model to a certain percentage of sparsity.

I am very new to this topic and have so far had mixed results and so wanted to reach out to people who would know more about it and perhaps have actively used this tool before.

So far I am trying to prune a very basic autoencoder-like model. I am not sure how to paste the code here, but here is a link to the same question I posted on SO

The goal of this project I am working on is to reduce the file size of the model.pb protobuf file so that when that model is loaded in a Tensorflow Serving container, said container requires less memory to load and run the model and as a result, the cost associated with that container running in the cloud is also lower ( my logic is less RAM needed → smaller bill at the end).

One thing I want to confirm is that I am correct in expecting that a pruned model with say 95% sparsity should have a lower file size when saved in SavedModel format as a protobuf file.

If that is indeed correct, then I am not sure why if I run the same code (as can be seen in the link) but change the code dimension of the autoencoder, for lower values the protobuf file for the pruned model is indeed small, whereas for larger values of the code dimension the protobuf file for the pruned model actually stays the same or in some cases even takes up more space. I can sort of understand staying the same since the number weights actually remains same, however taking up more space makes no sense to me. It seems to me that something must be wrong, but each time the script is run I fully reset the environment.

However, it is very possible I am just misunderstanding what weights pruning is meant to achieve and/or missing some very important implementation detail which is resulting in this sort of behavior. I have unfortunately not been able to find much online about optimizing production models meant for TF Serving outside of quantization and storing models as tflite. However, since, as I understand, TF Serving does not work with tflite models, that is not very useful to me.

I have been banging my head on the table with this problem for a while now so any help would be much much appreciated.