I hope you’re doing well.
Researchers have tried to train Vision Transformers (ViT) well in different ways. Training with more regularization for longer and distilling a well-trained CNN – these two recipes have clearly stood out among the simplest of the recipes to get a good ViT model.
In my latest project, I implement the DeiT family of ViT models, port the pre-trained params into the implementation, and provide code for off-the-shelf inference, fine-tuning, and visualizing attention rollout plots, and distilling ViT models through attention. Here are the important links:
Code for all the implementations: GitHub - sayakpaul/deit-tf: Includes PyTorch -> Keras model porting code for DeiT models with fine-tuning and inference notebooks.
Pre-trained DeiT models in TensorFlow / Keras: TensorFlow Hub
Tutorial: Distilling Vision Transformers
Fun fact: With DeiT, I hit a century in the number of models I’ve contributed to TF-Hub in about two years’ time (101 models):
Don’t hesitate to reach out if you have any questions. Have a great day!