In a webcam stream, model has a lag problem

I am working on an app to teach kids animal names.

The way it works is that they have a camera stream that looks at what they are pointing at, and says the animal to them(via saying the name, as well as the sound of the animal). For example, if a kid were to point the phone at a dog, the app would say “dog, woof”.

The model that I am using has a latency of around 2 seconds per prediction. I can’t show the code, but here is how it (kind of) works:

process_image: resizing, normalization
prediction
speak prediction
if the prediction came quickly, wait 3 - latency of prediction seconds.
Get the next available frame, not the second frame in the sequence.
Repeat.

A little problem that I found is that I point the camera at a dog, it says the prediction correctly, then I move the camera. The problem arises when it says dog again, before saying the correct prediction.

Why does this happen, and how can i fix it?

Thanks!

Hi,

But when it says dog again, did the model predicted dog on a random image or is it a delay on your pipeline?

are you using TFLite?
it would be better to put more information on what you’re using.
For image classification 2 seconds prediction is usually too much, which model are you using? if custom, how did you customize it?

2 Likes

Wait, sorry. 2 seconds isn’t the prediction time, it’s between 200-500 ms but there is extra time needed to say the prediction.

I’m using a tflite model which uses the yolov5s architecture. There are around 20 animals that I am detecting and found that yolo worked a lot better than ssd mobilenet or simple image classification with efficientnet.

Not sure what you mean by random image, but no, its the next available frame.

Here’s a more detailed pipeline(pseudocode)

predLoaded = true
camera.onNextFrame = predict 

async predict(image) {
   delay = 3000ms 
   if not predLoaded then return;
   
   Image =  await preprocess(image)
   Prediction = await model.predict(image)
   Prediction = await nms(prediction)
   Speak(prediction)

   Combined time = time(preprocess) + time(prediction) + time(nms)

   If Combined time < delay {
    Await Wait delay - Combined time
  }

   predLoaded = true

   
}

Sorry if it is a little confusing

Actually I just checked, Combined prediction time with saying animal name is around 5 seconds but on extremely fast devices it can get below the 3s threshold