Movenet/ p5.js video crop size

The Movenet model is not (yet) available in ml5.js however Movenet can still be used directly.

When using a video stream on mobile the speed is very slow unless the input size is reduced. This works great in the demo but when reducing the size of the input video in p5.js only the reduced input region seems to be used, almost like the input video is cropped.

Looking at detector.js there is a initCropRegion to initially set the region, I wonder if this is using the wrong input size and therefore using a cropped region as an input?

Example here: p5.js Web Editor

(Try moving your nose past half way…)

It seems you are working with 2 diff sizes in your code. 1 is 1280x720 and the other is 640x480. In order to render the correct position to the larger canvas you will need to scale all resulting points.

Assuming input video was 640x480 from getUserMedia then you can take the co-ordinates that come back and convert them to percentage offsets instead. For example, if nose point was at 320,240 then it can be represented as 0.5,0.5 which you then simply multiply by the new width/height eg 1280x720 to get a new enlarged co-ordinate of 640,360 - and then it should overlap correctly on your new video size.

Yep exactly, the first set of variables is my canvas size and second the video size and I am doing this conversion, if you set the video input size to 1280x720 or whatever your camera is, the code works. The scaling also works with a smaller size but only in the clipped zone…

When the video input size to something smaller, e.g. 360x180, only this portion of the video is used as input. It is as if the clipped portion has been used rather than the scaled input.

The video DOM element has the correct size.

So for me if I change your videoReady() function to have the sampled video width/height to assign that to your global wv/hv vars it works for me.

async function videoReady() {
  console.log("Capture loaded... or has it?");
  console.log("Capture: " + video.width + ", " + video.height);
  console.log("Video element: " + video.elt.videoWidth + ", " + video.elt.videoHeight);
  
  wv = video.elt.videoWidth;
  hv = video.elt.videoHeight;
  
  console.log("video ready");
  await getPoses();
}

In other news though this p5 code is running very slow which is odd as I usually get much higher FPS on my old laptop which I am using right now (eg 45 FPS which I just confirmed with the original MoveNet demo code) There may be some code you are using that is not efficient here which I advise checking for - however they are indeed rendered at the correct position now at least.

Ah I found why your code was running slow. You were using setTimeout with 100ms delay. Be sure to use requestAnimationFrame for production when not debugging to get buttery smooth performance like this:

async function getPoses() {
  if(detector){
    poses = await detector.estimatePoses(video.elt);
  
  }
       
  //console.log(poses);
  requestAnimationFrame(getPoses);
}
1 Like

Hi , thanks for clarification . I was wondering isnt there an inbuilt ĂĄlibrary for displaying the skeleton overlay ?

Another question, i tried to run this code on mobile browser of smartphone activating the front camera but didn’t seem to get any detection. I was under the impression it was independent of front camera/back camera to make detection.

Sample p5js sketch snippet

let detector;
let poses;
let video;
let wv=1280; //the default input size of the camera
let hv=720;
let w=640; //the size of the canvas
let h=480;
let f=2; //the amount we want to scale down the video input
var capture ;

async function init() {
  const detectorConfig = {
    modelType: poseDetection.movenet.modelType.SINGLEPOSE_LIGHTNING,
  };
  detector = await poseDetection.createDetector(
    poseDetection.SupportedModels.PoseNet,
    detectorConfig
  );
  //detector.initCropRegion(w,h); //not callable

}

async function videoReady() {
  console.log("Capture loaded... or has it?");
  console.log("Capture: " + video.width + ", " + video.height);
  console.log("Video element: " + video.elt.videoWidth + ", " + video.elt.videoHeight);
  
  wv = video.elt.videoWidth;
  hv = video.elt.videoHeight;
  
  //console.log("video ready");
  await getPoses();
}

function setup() {
  createCanvas(displayWidth, displayHeight);
  var constraints = {
    audio: false,
    video: {
      facingMode: {
        exact: "environment"
      }
    }    
    //video: {
      //facingMode: "user"
    //} 
  };
  capture = createCapture(constraints,videoReady);
  
  capture.hide();
}


async function getPoses() {
  if(detector){
    poses = await detector.estimatePoses(video.elt);
  
  }
       
  console.log(poses);
  requestAnimationFrame(getPoses);
}



function draw() {
 image(capture, 0, 0); 

}

Thanks Jason, this makes a big difference to the speed (I realised setTimeout was not ideal :grimacing:).

The problem still persists after assigning the width/ height in videoReady. I think it just appears like it is working for you as the capture size is equal to the canvas (e.g. 1280/2) or greater. Try it with a smaller capture size e.g. 360x180 or f=4 and it doesn’t work.

I have tested on Chrome, Safari, Firefox.

Try moving your head to the right half of the screen, you’ll see what I mean…

Example here https://imgur.com/a/ogaSb92 with 360x180 capture size.

The machine learning happens on the raw data from the video frame from the camera I believe (which on my laptop defaults to 640x480) not the CSS resizing of the video which is just cosmetic. When you grab the data from the video itself it gives you the camera data I think (640x480). Therefore the co-ordinates coming back will be in that co-ordinate space and you will need to assume that to do any transformation after on canvas circle drawing. Thus you should be having some math that figures out the percentage x,y based on the true video width height which you can then multiply by the canvas width/height to find the new co-ordinates you need to render at.

Also it seems P5 does some rendering magic of its own - it draws the video frame to canvas and then the dots on top of that, which is not terribly efficient as you are sampling video frame twice - you can just absolute position canvas on top of video element and draw the circles only on top of the already playing video saving you pushing twice the number of video pixel data around and only needing to worry about rendering dots to canvas based on the rendered size of the canvas.

f does not need to exist.

This is how I would do it:

let detector;
let poses;
let video;
let wv=1280; //the default input size of the camera
let hv=720;
let w=640; //the size of the canvas
let h=480;

async function init() {
  const detectorConfig = {
    modelType: poseDetection.movenet.modelType.SINGLEPOSE_LIGHTNING,
  };
  detector = await poseDetection.createDetector(
    poseDetection.SupportedModels.PoseNet,
    detectorConfig
  );
}

async function videoReady() {
  console.log("Capture loaded... or has it?");
  console.log("Capture: " + video.width + ", " + video.height);
  console.log("Video element: " + video.elt.videoWidth + ", " +   
             video.elt.videoHeight);
  
  wv = video.elt.videoWidth;
  hv = video.elt.videoHeight;
  
  console.log("video ready");
  await getPoses();
}

async function setup() {
  video = createCapture(VIDEO, videoReady);
  let cnv = createCanvas(w, h);
  cnv.style('position', 'absolute');
  await init();
}

async function getPoses() {
  if(detector){
    poses = await detector.estimatePoses(video.elt);
  }
       
  //console.log(poses);
  requestAnimationFrame(getPoses);
}

function draw() {
  if (poses && poses.length > 0) {
    clear();
    text('nose x: ' + poses[0].keypoints[0].x.toFixed(2) + ' ' + (100 * poses[0].keypoints[0].x / video.width).toFixed(1) + '%', 10,10);
    text('nose y: ' + poses[0].keypoints[0].y.toFixed(2) + ' ' + (100 * poses[0].keypoints[0].y / video.height).toFixed(1) + '%', 10,20);
    text('nose s: ' + poses[0].keypoints[0].score.toFixed(2), 10,30);
    for (let kp of poses[0].keypoints) {
      let { x, y, score } = kp;
      x = x/wv;
      y = y/hv;
      
      if (score > 0.5) {
        fill(255);
        stroke(0);
        strokeWeight(4);
        circle(x*w,y*h, 16);
      }
    }
  }
}
1 Like

Thanks Jason. My understanding was that on a mobile device/ camera which is far higher resolution that 480p the video needs to be scaled down for it to be performant (as is possible in the settings in the movenet demo).

I was trying to resize the video using video.size, which worked previously using ml5.js and posenet, I was under the impression this reduced the input frame size that the model uses. With my posenet projects on mobile if the video size was not reduced they ran at less than 5 FPS.

WIth your approach, how would I go about reducing the input size? (e.g. for a mobile device)

Thanks

If you are using our pre-made models like the above, we do all this resizing for you as it must fit the model’s supported input Tensor sizes correctly to use the ML model behind the scenes. To make these pre-made models easy to use we handle all of that.

Of course if you are using a raw TFJS model yourself (loading in the model.json and *.bin files manually) - not our nice JS wrapper classes, then of course you would need to do this resizing and conversion to Tensors yourself to perform an inference, but in the above, there is no need as we handle it all for you.

To resize images I recommend checking out the TensorFlow.js API which has functions for this purpose: TensorFlow.js API Note the differences between:

tf.image.resizeBilinear

tf.image.resizeNearestNeighbor

As this can effect your image data. See this image for a quick visual comparison of what different types of resize do to an image:

d4b174a1e5054f14b43cbd9b29e52672

Gant’s new TensorFlow.js book goes into a lot more detail on these methods if you are curious to learn more: Learning TensorFlow.js [Book]

1 Like