Good news, someone has actually done this already for sign language in TFJS and it worked pretty well and is generalizable to any time based gesture detection not just sign language.
I believe they used a custom trained PoseNet (so that they could have slightly different positioned key points being returned eg in center of palm instead of wrist for hands to be more accurate for hand gesture, combined with the handpose model (running twice as currently handpose is a single hand detector), an then facemesh ontop of all of that too - so actually multiple models running concurrently in browser to get a solid idea of what the human body is doing at any given time as facial expression / movement was important for sign language too.
Several frames of each model’s output over time are recorded, and the outputs of all of these are then fed into a higher level Graph Convolutional Network which makes predictions on the gesture it saw based on the inputs for all of the prior model outputs over time.
I found the general “temporal self-similarity matrix” (TSM) concept from RepNet gave me lots of ideas about this kind of temporal problem. I’ve been using it, plus some alignment ideas from CTC, for some things.