Batch inference’s main goal is to speed up inference per image when dealing with many images at once.
Say I have a large image(2560x1440) and I want to run it through my model which has an input size of 640x480. Historically, the large input image has been squished down to fit the 640x480 input size. While this works, its not exactly optimal for use cases that need to detect small objects because these objects get squished into even smaller pixel representations and can become impossible to detect.
Lets stick with the small objects use case.
One solution to increase the likelihood of detecting these small objects is to break the 2560x1440 image into smaller pieces, say 12 separate 640x480 images. Each of these images will fit nicely into the model, but there are 12 of them!
If we were to just place the inference code in a for loop and run it for each image, it would take 12 times as long as the original squished example and certainly cause latency issues.
Enter batch inference.
Instead of looping over each image and running it through inference by itself, we can modify the input tensor to accept batches of 12 images, and save time by only executing the inference call one time.
This^^ is how it should work. I’ve seen it first hand in this PyTorch example
When I go to do the same thing with tflite, I am greeted with a linearly scaling inference time for every image. There is no savings at all here.
So my question is, has anyone been able to implement batch processing using the tflite api and actually witness time savings during inference?