Videos are sequences of images. Modelling on video clips requires image representation models (CNNs) and sequence models (RNNs, LSTMs etc.) working together. While this approach is intuitive, how about a single model that takes care of the two modalities?
In our (with @ayush_thakur ) latest keras.io example, we minimally implement ViViT: A Video Vision Transformer by Arnab et al., a pure Transformer-based model for video classification. The authors propose a novel embedding scheme and a number of Transformer variants to model video clips.
arXiv: https://arxiv.org/abs/2103.15691
Tutorial: Video Vision Transformer