Model Maker TF Lite Slow Inference

I’m seeing slow object detection inference times on models trained using the efficientdet_lite0.

I’m using the TF Lite model maker example notebook for object detection with a custom dataset and am seeing inference times of 1.5-2 seconds on my MacBook Pro (single thread, no GPU). I can bring this down to around 0.75s with num_threads set to 4 but this seems to be much greater than the 37ms latency the notebook mentions. I thought it could be caused by overhead loading the model but subsequent calls to the interpreter.invoke method yield similar performance. My perf measurement is v basic, using time.perf_counter() either side of the invoke call. I’m quite new to all this so I feel like I’m doing something obviously wrong? Or am I missing something with post-training quantization?

In Google colab I’m seeing similar performance with the default notebook using the dataset provided in the notebook.

HI Nick,

in the colab test you did, is the runtime using a GPU? this can make a big difference.

I left the default settings for the notebook which I believe has the runtime set to GPU but I’ll double check that setting and rerun anyways. Even so, I was expecting comparable latency running the TFLite model on my macbook (quad core i7, 16gb ram) with 4 threads to the benchmarks listed in the notebook for pixel 4?

I’m going to try build tf from source rather than installing via pip as I know that should improve performance and I believe the benchmarks in the notebook are for the integer quantized model so I’ll also try that.

For the non-integer quantized model, running on TF installed via pip with 4 threads, does 0.6-0.75s per inference sound reasonable or is it likely I’ve messed something up along the way?

FYI the only bit of the notebook I changed is the detect_objects:

import time
start = time.perf_counter()
interpreter.invoke()
end = time.perf_counter()
print(end - start)

Edit: Tried with integer quantized model in colab with similar (slightly worse) results. I also tried running a separate call to invoke before timing in case there was some sort of model loading overhead but that didn’t change anything. Also using that model I get an error IndexError: index 25 is out of bounds for axis 0 with size 25 in the detect_objects method because the output_tensor[3] (which I assume is num_detections) has more items than the score output tensor. Easily fixed but just something I noted.

TFlite doesn’t build with Opencl GPU on Macosx and generally the standard TF runtime is better on desktop.
If you still need to use TFLite for testing you could try with the CPU XNNPACK delegate:

Edit:
See more at

For GPU you could subscribe also to:

Thanks! My MacBook doesn’t have a GPU so I’m focusing on performance on 4 CPUs like the EfficientDet benchmarks - I’ll try with the recommendations around optimising though the perf would need to improve by ~10x to match. Is it realistic to expect comparable performance to those benchmarks in the notebook?

My end goal will be to run on a device like an raspberry pi 4, jetson nano or something similar (min 4 CPUs, possibly GPU will be available).

Try with XNNPACK delegate.

You can also plan to use a Coral device/accelerator. See the benchmarks at:

Yeah I’ll check that out - I was considering getting the coral dev board mini.

I didn’t realise the pixel 4 had an edge TPU which makes sense now as the benchmarks on the coral website match up with the notebook.

Might be worth adding that to the table footnote as to me (in my naivety) it seemed to imply that latency was achievable on CPU only. Thanks for all the help!!

If you still need a Raspberry you can check also something like:

Might be worth adding that to the table footnote as to me (in my naivety) it seemed to imply that latency was achievable on CPU only.

I think that It is achievable on CPU as you can see in details from the benchmark section in:

https://tfhub.dev/tensorflow/lite-model/efficientdet/lite0/detection/metadata/1

I think that you problem on your MacOS is that the X86 experience Is generally optimized for the standard TF runtime but you can still try to achieve performance using TFlite with XNNPACK, if what your need for your model is already covered by what we have in the XNNPACK delegate, as it has ops fusion and SSE/AVX X86 kernels.

Thanks for those links. I couldn’t figure out how to pass xnnpack in the list of delegates in Python but I’ve rebuilt TF with tflite_with_xnnpack=true. With float16 quantization I’m able to get comparable performance. Int8 quantization didn’t yield any performance increase in my basic test. Though accuracy did seem to suffer but that is likely due to the fact they were trained on super small training set. I think I’ve got enough to go on now though - thank you for your help!

Yes, this Is the recommended way for Desktop

See the current limits and required flags in https://github.com/google/XNNPACK/issues/999

You can comment there for additional technical questions related to int8 support.

1 Like