Distillting ViTs through attention

Hi folks,

I hope you’re doing well.

Researchers have tried to train Vision Transformers (ViT) well in different ways. Training with more regularization for longer and distilling a well-trained CNN – these two recipes have clearly stood out among the simplest of the recipes to get a good ViT model.

In my latest project, I implement the DeiT family of ViT models, port the pre-trained params into the implementation, and provide code for off-the-shelf inference, fine-tuning, and visualizing attention rollout plots, and distilling ViT models through attention. Here are the important links:

Fun fact: With DeiT, I hit a century in the number of models I’ve contributed to TF-Hub in about two years’ time (101 models):


Thanks to @fchollet for reviewing the tutorial. Thanks to @ariG23498 who implemented some portions of the ViTClassifier class as shown in the tutorial.

Don’t hesitate to reach out if you have any questions. Have a great day!