Implementation of CaiT family of models

Why does the performance of deeper ViTs saturate on relatively smaller datasets? Architectures such as ResNets don’t suffer from this issue that much. Can we separate class attention from the self-attention stage from the patches thereby inducing a form of cross-attention?

CaiT (Going Deeper with Image Transformers) answers all these questions and provides solutions.

In my latest project, I implement the CaiT family of models with the pre-trained parameters from the official CaiT codebase. They have been evaluated on the ImageNet-1k validation set for correctness. The highest top-1 accuracy is 86.066% (only trained on ImageNet-1k).

Code, models, interactive demos, notebooks for fine-tuning, off-the-shelf inference are here

Additionally, this Vision Transformer uses the Talking Head attention. So, this project could serve as a reference for the TF implementation of that.


Spatial-class relationships

Check out the demos on Hugging Face Spaces:

1 Like

I’m fortunate to be able to dedicate significant time and money of my own supporting this and other open source projects. However, as the projects increase in scope, outside support is needed to continue with the current trajectory of cloud services, hardware, and electricity costs.

I have exactly the same issue.