Transformers for image prediction


I was wondering if Transformers can be use for image prediction?
For example: you’ve got a sequence of images and you want to predict what the next image in the sequence will be.

I’ve found plenty of examples for image classification (ViT), but not for prediction. (If you have some examples I appreciate the help)


Hi, welcome to the TF Forum!

Great question.

There’s a fairly recent paper by Google Research from 2021 called VATT (Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text) that was published with TensorFlow 2 code.

Other research papers that may be related to this task: