Knowledge distillation with "Function Matching"

Hi folks,

Today I am pleased to open-source the code for implementing the recipes from Knowledge distillation: A good teacher is patient and consistent (function matching) and reproducing their results on three benchmark datasets: Pet37, Flowers102, and Food101.

Importance: The importance of knowledge distillation lies in its practical usefulness. With the recipes from “function matching”, we can now perform knowledge distillation using a principled approach yielding student models that can actually match the performance of their teacher models. This essentially allows us to compress bigger models into (much) smaller ones thereby reducing storage costs and improving inference speed.

Some features of the repository I wanted to highlight:

  • The code is provided as Kaggle Kernel Notebooks to allow the usage of free TPU v3-8 hardware. This is important because the training schedules are comparatively longer.
  • There’s a notebook on distributed hyperparameter tuning and it’s often not included in the public release of an implementation.
  • For reproducibility and convenience, I have provided pre-trained models and TFRecords for all the datasets I used.

Here’s a link to the repository

I’d like to sincerely thank Lucas Beyer (first author of the paper) for providing crucial feedback on the earlier implementations, ML-GDE program for the GCP support, and TRC for providing TPU access. For any questions, either create an issue in the repository directly or email me.

Thank you for reading!

3 Likes

Very exciting Sayak, thanks!

1 Like

Thanks,
also from Google It seems that knowledge distillation is something that we need to handle with care:

Yes I am aware of this paper since it had come out during the same time.

Here’s another from MSFT Research in case you want to dive even further:

There are also works that show how knowledge distillation may ignore the skewed portions of the dataset.

But business needs and objectives drive the carefulness I think. Knowledge distillation has the promise to account for both without incorporating too much technical debt or complications in the overall pipeline. Amongst three popular compression schemes (quantization, pruning, and distillation) distillation is my favourite.

I think distillation and pruning are more similar as they both relies on the overparametrization and in distillation you control the prior of the student model arch.

But if the pruning could start to be efficently structured I think that at some point the two approaches could go to converge in something else. See some Deepmind early findings:

1 Like

Nice. Very rich exchange of information.

Thanks :slight_smile:

1 Like