How to input videos in Video Vision Transformer?


I would like to run the ViViT example in Keras

using my own dataset. I have two folders with videos. First folder refers to class X second to class . I do not understand how to input videos in Video Vision Transformer. Could you please help me?


Hi @MSRIC, As per my knowledge for classifying videos the simplest approach would be to apply the image classification model to individual frames, use the sequence model to learn sequences of image features.
For extracting frames from video you can use an opencv. For example to read a video file using cv2

import cv2
cap = cv2.VideoCapture('/content/DeepfakeTIMIT/DeepfakeTIMIT/higher_quality/fadg0/sa1-video-fram1.avi')

To know the number frames present in a video file

fps = cap.get(cv2.CAP_PROP_FPS)
print("Frame per second:",fps )

totalNoFrames = cap.get(cv2.CAP_PROP_FRAME_COUNT)
print("Total Number of Frames:",totalNoFrames)

durationInSeconds = totalNoFrames / fps
print("Video Duration In Seconds:",durationInSeconds,"s")

Frame per second: 25.0
Total Number of Frames: 119.0
Video Duration In Seconds: 4.76 s

Based upon your requirement(number to frames you want) you can collect the frames from the video using

cap.set(cv2.CAP_PROP_POS_FRAMES, frame_count)

And save the frames in the directory where the directory same should be a label. Then you can train the image classification model of the frames collected from the video. For making predictions the input video also should be converted to frames and make predictions on those frames.

Thank You.