Implementing "ViViT: A Video Vision Transformer"

Videos are sequences of images. Modelling on video clips requires image representation models (CNNs) and sequence models (RNNs, LSTMs etc.) working together. While this approach is intuitive, how about a single model that takes care of the two modalities?

In our (with @ayush_thakur ) latest example, we minimally implement ViViT: A Video Vision Transformer by Arnab et al., a pure Transformer-based model for video classification. The authors propose a novel embedding scheme and a number of Transformer variants to model video clips.

Tutorial: Video Vision Transformer