Swipe gesture recognition where to start


I want to develop a system (or implement an existing one) for hand-based swipe gesture recognition.
I am realtively new to the topic but have been using python and tensor flow a bit before.
Do you have ideas where to start or could point out some resources or general methods of how to approach it?
In our system we have a z cam, so we could go either full image based or 3d point based by tracking hand positons in 3d.

Thank you in advance,

Take a look at:

Wanted to give a short update, so everyone who is interested in it can follow this as a documentation as well. Also would be happy about advice on my current issue. . Also thanks for pointing this out @Bhack

In this and in a couple of other references, i learned that a good way to train for swiping gestures would be by using the 20bn Jester data-set. Which is not available anymore in its full scale, but which you can get a subset of on kaggle

I managed to extract my relevant gestures [Swiping Right, Swiping Left, Swiping Up, Swiping Down, No gesture] and in total it was approximately 8000 videos for training and 1700 videos for classification.

I tried a couple of repos, which i forked and updated, I also added my google colab code into them, so in case anyone wants to work with it, you are welcome!

The first one i tried was using 3DCNN ( GitHub - lunanane/3D-CNN-Gesture-recognition: Gesture recognition using tensorflow from a large video database)
It was a very outdated and has a very slow image resizing step inside, which i tried to counter by making a separate colab that resizes before training.
I got this working after fixing depracted code, but the training didn’t seem to run correctly, as it resulted in having NAN as loss and accuracy of 1.0 during the training. The resulting model predicts always “Swiping Left” - but this was a great start for learning and probably the code just has to be updated a bit because of newer tensor flow version. (there was no info about which tensorflow version was used initially, so couldnt build the fitting environment and had to update)

Then I tried a repo which uses media-pipe for extracing hand pose from the videos. The model can be found in my git as well under training-media-pipe-model repo.
It would then use a k-nearest neighbor approach to predict which gesture was performed.
At leat in theory. In praxis I could not make it work because of missing versioning info of the dependencies. And it is broken with the newer media-pipe, because the code for extracting the hand pose plot from video, just doesnt work and I couldnt wrap my head around updating it to newest media-pipe in the time I had. So I couldn’t even move on to the k-nearest neighbor training.

Then i found finally this repo: GitHub - lunanane/Gesture-Recognition: To recognize 5 types of hand movements like left swipe, right swipe, up swipe, down swipe and hold still.
Which essentially is relatively up to date and uses a keras model. I got everything working, colab files are in the repo as well.
It is now training well and the indicators say that it increases accuracy slowly in the training.
So i think I got a correct model probably. Now my current problem is, that this repo didn’t come with an camera app. And I am trying to program it right now. But i can’t seem to match the input shapes from web cam input to the model’s shape.

The model that was used looks like that

#write your model here

# create architecture

# define parameters

n_output = 5 # number of classes in case of classification, 1 in case of regression

output_activation = 'softmax' # “softmax” or “sigmoid” in case of classification, “linear” in case of regression

model1 = Sequential()



















model1.add(Dense(512, activation = 'relu'))


model1.add(Dense(256, activation = 'relu'))



and compiled there is this output

from tensorflow import keras

#from keras.optimizer_v2.adam import Adam as Adam

optimiser = keras.optimizers.Adam(lr=0.01)

model1.compile(optimizer=optimiser, loss='categorical_crossentropy', metrics=['categorical_accuracy'])

print (model1.summary())

Model: "sequential"
 Layer (type)                Output Shape              Param #   
 conv3d (Conv3D)             (None, 15, 100, 100, 32)  800       
 batch_normalization (BatchN  (None, 15, 100, 100, 32)  128      
 max_pooling3d (MaxPooling3D  (None, 7, 50, 50, 32)    0         
 dropout (Dropout)           (None, 7, 50, 50, 32)     0         
 conv3d_1 (Conv3D)           (None, 7, 50, 50, 32)     8224      
 batch_normalization_1 (Batc  (None, 7, 50, 50, 32)    128       
 max_pooling3d_1 (MaxPooling  (None, 3, 25, 25, 32)    0         
 dropout_1 (Dropout)         (None, 3, 25, 25, 32)     0         
 conv3d_2 (Conv3D)           (None, 3, 25, 25, 64)     2112      
 batch_normalization_2 (Batc  (None, 3, 25, 25, 64)    256       
 max_pooling3d_2 (MaxPooling  (None, 3, 12, 25, 64)    0         
 dropout_2 (Dropout)         (None, 3, 12, 25, 64)     0         
 conv3d_3 (Conv3D)           (None, 3, 12, 25, 128)    221312    
 batch_normalization_3 (Batc  (None, 3, 12, 25, 128)   512       
 max_pooling3d_3 (MaxPooling  (None, 1, 6, 25, 128)    0         
 dropout_3 (Dropout)         (None, 1, 6, 25, 128)     0         
 global_average_pooling3d (G  (None, 128)              0         
 dropout_4 (Dropout)         (None, 128)               0         
 dense (Dense)               (None, 512)               66048     
 batch_normalization_4 (Batc  (None, 512)              2048      
 dense_1 (Dense)             (None, 256)               131328    
 batch_normalization_5 (Batc  (None, 256)              1024      
 dense_2 (Dense)             (None, 5)                 1285      
Total params: 435,205
Trainable params: 433,157
Non-trainable params: 2,048

And in the camera app, i am trying to match this shape, which is as far as i understand (none, 15, 100,100, 3) , maybe correct me on this.

I then try to run the camera app as like this

import cv2
import numpy as np
from PIL import Image
from keras import models

#Load the saved model
model = models.load_model('model-00001.h5')
video = cv2.VideoCapture(0)

while True:
    _, frame = video.read()

    #Convert the captured frame into RGB
    im = Image.fromarray(frame, 'RGB')

    #Resizing into 128x128 because we trained the model with this image size.
    im = im.resize((100,100))
    img_array = np.array(im)

    #Our keras model used a 4D tensor, (images x height x width x channel)
    #So changing dimension 128x128x3 into 1x128x128x3 

    img_array = np.expand_dims(img_array, axis=0)
    img_array = np.expand_dims(img_array, axis=0)

    #Calling the predict method on model to predict 'me' on the image
    prediction = int(model.predict(img_array)[0][0])

    #if prediction is 0, which means I am missing on the image, then show the frame in gray color.
    if prediction == 0:
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    cv2.imshow("Capturing", frame)

    if key == ord('q'):

I am getting an error in execution:

Model was constructed with shape (None,15,100,100,3) for input Keras Tensor(type_spec=TensorSpec(shape=(None, 14, 100,100,3), dtype=tf.float32, name=“conv3d_input”), name= …
… but it was called on an input with incompatibe shape (None, 1, 100, 100, 3)

And here I am a bit stuck. Does someone has an idea how to write the camera app in a way that it can execute this particular model and return the different indices for the gestures?

… and sorry for long post, i just imagine that someone else would not have to conclude all this from scratch again when finding my documentation and problems i face.

Isn’t just that the model expect 15 frames?