What model(s) for a sequence of human poses?

kemal · February 24, 2021, 12:47am

Hi!

I would like to recognize a sequence of human poses, with a predefined timing. For example: recognize a tennis serve, a soccer kick, a ballet move, etc.

I have looked at pose similarity for single pose comparison here (Move Mirror: An AI Experiment with Pose Estimation in the Browser using TensorFlow.js — The TensorFlow Blog).

Is there a recommended model for a sequence of poses (LSTM?). I would also like to identify the deviation from ideal poses and timing (i.e. too early/late for this pose).

Thanks!

Laurence_Moroney · February 24, 2021, 11:21pm

It’s a great quesiton, and I’d love to know the answer too!

I’m thinking multiple models would have to be used.

One is pose detection which captures a sequence of poses over time.

One is a sequence model, trained on sequences of poses to classify them as ‘good’ or ‘bad’ or some other label.

At runtime you then capture the sequence using the posenet, pull out the relevant parts of the skeleton, and feed that into the sequence model to get a classification.

Magnus · February 26, 2021, 6:10pm

@Jason would you know or have any ideas?

timdavis · March 2, 2021, 3:45am

I would checkout Pose - mediapipe

Jason · March 3, 2021, 2:57am

Good news, someone has actually done this already for sign language in TFJS and it worked pretty well and is generalizable to any time based gesture detection not just sign language.

I believe they used a custom trained PoseNet (so that they could have slightly different positioned key points being returned eg in center of palm instead of wrist for hands to be more accurate for hand gesture, combined with the handpose model (running twice as currently handpose is a single hand detector), an then facemesh ontop of all of that too - so actually multiple models running concurrently in browser to get a solid idea of what the human body is doing at any given time as facial expression / movement was important for sign language too.

Several frames of each model’s output over time are recorded, and the outputs of all of these are then fed into a higher level Graph Convolutional Network which makes predictions on the gesture it saw based on the inputs for all of the prior model outputs over time.

Mat_Kelcey · March 20, 2021, 8:56pm

I found the general “temporal self-similarity matrix” (TSM) concept from RepNet gave me lots of ideas about this kind of temporal problem. I’ve been using it, plus some alignment ideas from CTC, for some things.

Bhack · March 21, 2021, 12:06am

A multitask network (pose + action recognition) could be a good baseline to explore the task GitHub - dluvizon/deephar: Deep human action recognition and pose estimation