Video Swin Transformer in Keras

We have reimplemented Video Swin Transformer model in #Keras, considering supporting multi-backend framework in future. The pretrained weights are also available in both SavedModel and H5 format.

#VideoSwin is a pure transformer based video modeling algorithm, attained top accuracy on the major video recognition benchmarks.

An inference highlights:

from videoswin import VideoSwinT

>>> model = VideoSwinT(num_classes=400)
>>> model.load_weights(
   'TFVideoSwinT_K400_IN1K_P244_W877_32x224.h5'
)
>>> container = read_video('sample.mp4')
>>> frames = frame_sampling(container, num_frames=32)
>>> y = model(frames)
>>> y.shape
TensorShape([1, 400])

>>> probabilities = tf.nn.softmax(y_pred_tf)
>>> probabilities = probabilities.numpy().squeeze(0)
>>> confidences = {
    label_map_inv[i]: float(probabilities[i]) \
    for i in np.argsort(probabilities)[::-1]
}
>>> confidences

A classification results on a sample from Kinetics-400.

view1

{
    'playing_cello': 0.9941741824150085,
    'playing_violin': 0.0016851733671501279,
    'playing_recorder': 0.0011555481469258666,
    'playing_clarinet': 0.0009695519111119211,
    'playing_harp': 0.0007713600643910468
}
1 Like

Starter for fine-tuning Video Swin Transformer on custom dataset.