TF.js for robust object tracking in the wild?

Hi there,

I am almost entirely new to TF or any ML. I am a software developer and do some web-applications. As an improvement to one project, it would be great if there was a possibility, to detect and track first a passport and later a face on the users webcam or smartphone cam. This is part of an identification process and will support the system to take pictures at the right moment with optimal framing.

I already played a littlebit with the teachable machine demo and got the impression, that this kind of approach even in the browser is realtime capable on a normal enduser-computer or smartphone.
I know that the teachable machine does image classification instead of tracking. And I am not sure what would be the best approach.
Basically I want a person to show the front-side of his passport in the webcam - take a photo - show the rear side - take a photo. Show the own face - take a photo - and another photo of the face with some parallaxe to separate a photo of a face from a real 3d-object.

Regarding this idea/vision, I have several questions which I would really like to know to ask somebody who knows things about TF.js:

  1. What method should I go for: Object tracking or Image classification? (what kind of training data would be required?)
  2. What footprint in terms of end-user-filesize could I expect to end up with?
  3. What technology/processes and tools do I have to learn to realize this?