Wanted to give a short update, so everyone who is interested in it can follow this as a documentation as well. Also would be happy about advice on my current issue. . Also thanks for pointing this out @Bhack
In this and in a couple of other references, i learned that a good way to train for swiping gestures would be by using the 20bn Jester data-set. Which is not available anymore in its full scale, but which you can get a subset of on kaggle
I managed to extract my relevant gestures [Swiping Right, Swiping Left, Swiping Up, Swiping Down, No gesture] and in total it was approximately 8000 videos for training and 1700 videos for classification.
I tried a couple of repos, which i forked and updated, I also added my google colab code into them, so in case anyone wants to work with it, you are welcome!
The first one i tried was using 3DCNN ( GitHub - lunanane/3D-CNN-Gesture-recognition: Gesture recognition using tensorflow from a large video database)
It was a very outdated and has a very slow image resizing step inside, which i tried to counter by making a separate colab that resizes before training.
I got this working after fixing depracted code, but the training didn’t seem to run correctly, as it resulted in having NAN as loss and accuracy of 1.0 during the training. The resulting model predicts always “Swiping Left” - but this was a great start for learning and probably the code just has to be updated a bit because of newer tensor flow version. (there was no info about which tensorflow version was used initially, so couldnt build the fitting environment and had to update)
Then I tried a repo which uses media-pipe for extracing hand pose from the videos. The model can be found in my git as well under training-media-pipe-model repo.
It would then use a k-nearest neighbor approach to predict which gesture was performed.
At leat in theory. In praxis I could not make it work because of missing versioning info of the dependencies. And it is broken with the newer media-pipe, because the code for extracting the hand pose plot from video, just doesnt work and I couldnt wrap my head around updating it to newest media-pipe in the time I had. So I couldn’t even move on to the k-nearest neighbor training.
Then i found finally this repo: GitHub - lunanane/Gesture-Recognition: To recognize 5 types of hand movements like left swipe, right swipe, up swipe, down swipe and hold still.
Which essentially is relatively up to date and uses a keras model. I got everything working, colab files are in the repo as well.
It is now training well and the indicators say that it increases accuracy slowly in the training.
So i think I got a correct model probably. Now my current problem is, that this repo didn’t come with an camera app. And I am trying to program it right now. But i can’t seem to match the input shapes from web cam input to the model’s shape.
The model that was used looks like that
#write your model here
# create architecture
# define parameters
n_output = 5 # number of classes in case of classification, 1 in case of regression
output_activation = 'softmax' # “softmax” or “sigmoid” in case of classification, “linear” in case of regression
model1 = Sequential()
model1.add(Dense(512, activation = 'relu'))
model1.add(Dense(256, activation = 'relu'))
and compiled there is this output
from tensorflow import keras
#from keras.optimizer_v2.adam import Adam as Adam
optimiser = keras.optimizers.Adam(lr=0.01)
model1.compile(optimizer=optimiser, loss='categorical_crossentropy', metrics=['categorical_accuracy'])
Layer (type) Output Shape Param #
conv3d (Conv3D) (None, 15, 100, 100, 32) 800
batch_normalization (BatchN (None, 15, 100, 100, 32) 128
max_pooling3d (MaxPooling3D (None, 7, 50, 50, 32) 0
dropout (Dropout) (None, 7, 50, 50, 32) 0
conv3d_1 (Conv3D) (None, 7, 50, 50, 32) 8224
batch_normalization_1 (Batc (None, 7, 50, 50, 32) 128
max_pooling3d_1 (MaxPooling (None, 3, 25, 25, 32) 0
dropout_1 (Dropout) (None, 3, 25, 25, 32) 0
conv3d_2 (Conv3D) (None, 3, 25, 25, 64) 2112
batch_normalization_2 (Batc (None, 3, 25, 25, 64) 256
max_pooling3d_2 (MaxPooling (None, 3, 12, 25, 64) 0
dropout_2 (Dropout) (None, 3, 12, 25, 64) 0
conv3d_3 (Conv3D) (None, 3, 12, 25, 128) 221312
batch_normalization_3 (Batc (None, 3, 12, 25, 128) 512
max_pooling3d_3 (MaxPooling (None, 1, 6, 25, 128) 0
dropout_3 (Dropout) (None, 1, 6, 25, 128) 0
global_average_pooling3d (G (None, 128) 0
dropout_4 (Dropout) (None, 128) 0
dense (Dense) (None, 512) 66048
batch_normalization_4 (Batc (None, 512) 2048
dense_1 (Dense) (None, 256) 131328
batch_normalization_5 (Batc (None, 256) 1024
dense_2 (Dense) (None, 5) 1285
Total params: 435,205
Trainable params: 433,157
Non-trainable params: 2,048
And in the camera app, i am trying to match this shape, which is as far as i understand (none, 15, 100,100, 3) , maybe correct me on this.
I then try to run the camera app as like this
import numpy as np
from PIL import Image
from keras import models
#Load the saved model
model = models.load_model('model-00001.h5')
video = cv2.VideoCapture(0)
_, frame = video.read()
#Convert the captured frame into RGB
im = Image.fromarray(frame, 'RGB')
#Resizing into 128x128 because we trained the model with this image size.
im = im.resize((100,100))
img_array = np.array(im)
#Our keras model used a 4D tensor, (images x height x width x channel)
#So changing dimension 128x128x3 into 1x128x128x3
img_array = np.expand_dims(img_array, axis=0)
img_array = np.expand_dims(img_array, axis=0)
#Calling the predict method on model to predict 'me' on the image
prediction = int(model.predict(img_array))
#if prediction is 0, which means I am missing on the image, then show the frame in gray color.
if prediction == 0:
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
if key == ord('q'):
I am getting an error in execution:
Model was constructed with shape (None,15,100,100,3) for input Keras Tensor(type_spec=TensorSpec(shape=(None, 14, 100,100,3), dtype=tf.float32, name=“conv3d_input”), name= …
… but it was called on an input with incompatibe shape (None, 1, 100, 100, 3)
And here I am a bit stuck. Does someone has an idea how to write the camera app in a way that it can execute this particular model and return the different indices for the gestures?
… and sorry for long post, i just imagine that someone else would not have to conclude all this from scratch again when finding my documentation and problems i face.