Distillting ViTs through attention

Sayak_Paul · April 12, 2022, 12:59am

Hi folks,

I hope you’re doing well.

Researchers have tried to train Vision Transformers (ViT) well in different ways. Training with more regularization for longer and distilling a well-trained CNN – these two recipes have clearly stood out among the simplest of the recipes to get a good ViT model.

In my latest project, I implement the DeiT family of ViT models, port the pre-trained params into the implementation, and provide code for off-the-shelf inference, fine-tuning, and visualizing attention rollout plots, and distilling ViT models through attention. Here are the important links:

Code for all the implementations: GitHub - sayakpaul/deit-tf: Includes PyTorch -> Keras model porting code for DeiT models with fine-tuning and inference notebooks.
Pre-trained DeiT models in TensorFlow / Keras: TensorFlow Hub
Tutorial: Distilling Vision Transformers

Fun fact: With DeiT, I hit a century in the number of models I’ve contributed to TF-Hub in about two years’ time (101 models):

Thanks to @fchollet for reviewing the tutorial. Thanks to @ariG23498 who implemented some portions of the ViTClassifier class as shown in the tutorial.

Don’t hesitate to reach out if you have any questions. Have a great day!