Model Maker TF Lite Slow Inference

Nick_J · September 2, 2021, 5:24pm

I’m seeing slow object detection inference times on models trained using the efficientdet_lite0.

I’m using the TF Lite model maker example notebook for object detection with a custom dataset and am seeing inference times of 1.5-2 seconds on my MacBook Pro (single thread, no GPU). I can bring this down to around 0.75s with num_threads set to 4 but this seems to be much greater than the 37ms latency the notebook mentions. I thought it could be caused by overhead loading the model but subsequent calls to the interpreter.invoke method yield similar performance. My perf measurement is v basic, using time.perf_counter() either side of the invoke call. I’m quite new to all this so I feel like I’m doing something obviously wrong? Or am I missing something with post-training quantization?

In Google colab I’m seeing similar performance with the default notebook using the dataset provided in the notebook.

lgusm · September 4, 2021, 4:39pm

HI Nick,

in the colab test you did, is the runtime using a GPU? this can make a big difference.

Nick_J · September 4, 2021, 6:11pm

I left the default settings for the notebook which I believe has the runtime set to GPU but I’ll double check that setting and rerun anyways. Even so, I was expecting comparable latency running the TFLite model on my macbook (quad core i7, 16gb ram) with 4 threads to the benchmarks listed in the notebook for pixel 4?

I’m going to try build tf from source rather than installing via pip as I know that should improve performance and I believe the benchmarks in the notebook are for the integer quantized model so I’ll also try that.

For the non-integer quantized model, running on TF installed via pip with 4 threads, does 0.6-0.75s per inference sound reasonable or is it likely I’ve messed something up along the way?

FYI the only bit of the notebook I changed is the detect_objects:

import time
start = time.perf_counter()
interpreter.invoke()
end = time.perf_counter()
print(end - start)

Edit: Tried with integer quantized model in colab with similar (slightly worse) results. I also tried running a separate call to invoke before timing in case there was some sort of model loading overhead but that didn’t change anything. Also using that model I get an error IndexError: index 25 is out of bounds for axis 0 with size 25 in the detect_objects method because the output_tensor[3] (which I assume is num_detections) has more items than the score output tensor. Easily fixed but just something I noted.

Bhack · September 5, 2021, 1:37am

TFlite doesn’t build with Opencl GPU on Macosx and generally the standard TF runtime is better on desktop.
If you still need to use TFLite for testing you could try with the CPU XNNPACK delegate:

Edit:
See more at

github.com

tensorflow/tensorflow/blob/master/tensorflow/lite/delegates/xnnpack/README.md

# XNNPACK backend for TensorFlow Lite

XNNPACK is a highly optimized library of neural network inference operators for
ARM, x86, and WebAssembly architectures in Android, iOS, Windows, Linux, macOS,
and Emscripten environments. This document describes how to use the XNNPACK
library as an inference engine for TensorFlow Lite.

## Using XNNPACK engine with TensorFlow Lite interpreter

XNNPACK integrates with TensorFlow Lite interpreter through the delegation
mechanism. TensorFlow Lite supports several methods to enable XNNPACK
for floating-point inference.

### Enable XNNPACK via Java API on Android (recommended on Android)

Pre-built [nightly TensorFlow Lite binaries for Android](https://www.tensorflow.org/lite/guide/android#use_the_tensorflow_lite_aar_from_mavencentral)
include XNNPACK, albeit it is disabled by default. Use the `setUseXNNPACK`
method in `Interpreter.Options` class to enable it:

```java

This file has been truncated. show original

Bhack · September 5, 2021, 2:26pm

For GPU you could subscribe also to:

github.com/tensorflow/tensorflow

Error when building tflite 2.3.0-rc0 metal delegate on macOS

opened 06:39PM - 02 Jul 20 UTC

closed 10:32AM - 10 Nov 21 UTC

nkjassal

stat:awaiting response type:build/install stalled comp:lite TF 2.3

<em>Please make sure that this is a build/installation issue. As per our [GitHub… Policy](https://github.com/tensorflow/tensorflow/blob/master/ISSUES.md), we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template</em> **System information** - OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS 10.15.2 - TensorFlow installed from (source or binary): source - TensorFlow version: 2.3.0-rc0 - Python version: 3.6.4 - Installed using virtualenv? pip? conda?: virtualenv - Bazel version (if compiling from source): - GCC/Compiler version (if compiling from source): - CUDA/cuDNN version: 3.1.0 - GPU model and memory: AMD Radeon R9 M370X 2 GB **Describe the problem** The metal delegate dylib for macOS fails to build using the recommended command from the BUILD file (`tensorflow/lite/delegates/gpu/BUILD:L181`). **Provide the exact sequence of commands / steps that you executed before running into the problem** From `tensorflow-2.3.0-rc0/`: ``` bazel build -c opt --copt -Os --copt -DTFLITE_GPU_BINARY_RELEASE --copt -fvisibility=default --linkopt -s --strip always --cxxopt=-std=c++14 --apple_platform_type=macos //tensorflow/lite/delegates/gpu:tensorflow_lite_gpu_dylib ``` **Any other info / logs** Running the above command produces error log: ``` bazel build -c opt --copt -Os --copt -DTFLITE_GPU_BINARY_RELEASE --copt -fvisibility=default --linkopt -s --strip always --cxxopt=-std=c++14 --apple_platform_type=macos //tensorflow/lite/delegates/gpu:tensorflow_lite_gpu_dylib INFO: Options provided by the client: Inherited 'common' options: --isatty=1 --terminal_columns=174 INFO: Reading rc options for 'build' from /Users/njassal/dev/tensorflow/tensorflow-2.3.0-rc0/.bazelrc: Inherited 'common' options: --experimental_repo_remote_exec INFO: Reading rc options for 'build' from /Users/njassal/dev/tensorflow/tensorflow-2.3.0-rc0/.bazelrc: 'build' options: --apple_platform_type=macos --define framework_shared_object=true --define open_source_build=true --java_toolchain=//third_party/toolchains/java:tf_java_toolchain --host_java_toolchain=//third_party/toolchains/java:tf_java_toolchain --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --noincompatible_prohibit_aapt1 --enable_platform_specific_config --config=v2 INFO: Reading rc options for 'build' from /Users/njassal/dev/tensorflow/tensorflow-2.3.0-rc0/.tf_configure.bazelrc: 'build' options: --action_env PYTHON_BIN_PATH=/Users/njassal/.virtualenvs/tensorflow-2.2.0/bin/python3 --action_env PYTHON_LIB_PATH=/Users/njassal/.virtualenvs/tensorflow-2.2.0/lib/python3.6/site-packages --python_path=/Users/njassal/.virtualenvs/tensorflow-2.2.0/bin/python3 --config=xla --action_env TF_CONFIGURE_IOS=1 INFO: Found applicable config definition build:v2 in file /Users/njassal/dev/tensorflow/tensorflow-2.3.0-rc0/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1 INFO: Found applicable config definition build:xla in file /Users/njassal/dev/tensorflow/tensorflow-2.3.0-rc0/.bazelrc: --action_env=TF_ENABLE_XLA=1 --define=with_xla_support=true INFO: Found applicable config definition build:macos in file /Users/njassal/dev/tensorflow/tensorflow-2.3.0-rc0/.bazelrc: --copt=-w --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --cxxopt=-std=c++14 --host_cxxopt=-std=c++14 DEBUG: Rule 'io_bazel_rules_docker' indicated that a canonical reproducible form can be obtained by modifying arguments shallow_since = "1556410077 -0400" DEBUG: Repository io_bazel_rules_docker instantiated at: no stack (--record_rule_instantiation_callstack not enabled) Repository rule git_repository defined at: /private/var/tmp/_bazel_njassal/c07bdfc7f101779d38a8eaaebedd6122/external/bazel_tools/tools/build_defs/repo/git.bzl:195:18: in <toplevel> INFO: Repository eigen_archive instantiated at: no stack (--record_rule_instantiation_callstack not enabled) Repository rule tf_http_archive defined at: /Users/njassal/dev/tensorflow/tensorflow-2.3.0-rc0/third_party/repo.bzl:134:19: in <toplevel> ERROR: /private/var/tmp/_bazel_njassal/c07bdfc7f101779d38a8eaaebedd6122/external/cpuinfo/BUILD.bazel:96:1: Configurable attribute "srcs" doesn't match this configuration (would a default condition help?). Conditions checked: @cpuinfo//:linux_x86_64 @cpuinfo//:linux_arm @cpuinfo//:linux_armhf @cpuinfo//:linux_armv7a @cpuinfo//:linux_armeabi @cpuinfo//:linux_aarch64 @cpuinfo//:macos_x86_64 @cpuinfo//:windows_x86_64 @cpuinfo//:android_armv7 @cpuinfo//:android_arm64 @cpuinfo//:android_x86 @cpuinfo//:android_x86_64 @cpuinfo//:ios_x86_64 @cpuinfo//:ios_x86 @cpuinfo//:ios_armv7 @cpuinfo//:ios_arm64 @cpuinfo//:ios_arm64e @cpuinfo//:watchos_x86_64 @cpuinfo//:watchos_x86 @cpuinfo//:watchos_armv7k @cpuinfo//:watchos_arm64_32 @cpuinfo//:tvos_x86_64 @cpuinfo//:tvos_arm64 WARNING: Download from https://mirror.bazel.build/github.com/Maratyszcza/FP16/archive/4dfe081cf6bcd15db339cf2680b9281b8451eeb3.zip failed: class com.google.devtools.build.lib.bazel.repository.downloader.UnrecoverableHttpException GET returned 404 Not Found ERROR: Analysis of target '//tensorflow/lite/delegates/gpu:tensorflow_lite_gpu_dylib' failed; build aborted: /private/var/tmp/_bazel_njassal/c07bdfc7f101779d38a8eaaebedd6122/external/cpuinfo/BUILD.bazel:96:1: Configurable attribute "srcs" doesn't match this configuration (would a default condition help?). Conditions checked: @cpuinfo//:linux_x86_64 @cpuinfo//:linux_arm @cpuinfo//:linux_armhf @cpuinfo//:linux_armv7a @cpuinfo//:linux_armeabi @cpuinfo//:linux_aarch64 @cpuinfo//:macos_x86_64 @cpuinfo//:windows_x86_64 @cpuinfo//:android_armv7 @cpuinfo//:android_arm64 @cpuinfo//:android_x86 @cpuinfo//:android_x86_64 @cpuinfo//:ios_x86_64 @cpuinfo//:ios_x86 @cpuinfo//:ios_armv7 @cpuinfo//:ios_arm64 @cpuinfo//:ios_arm64e @cpuinfo//:watchos_x86_64 @cpuinfo//:watchos_x86 @cpuinfo//:watchos_armv7k @cpuinfo//:watchos_arm64_32 @cpuinfo//:tvos_x86_64 @cpuinfo//:tvos_arm64 INFO: Elapsed time: 0.123s INFO: 0 processes. FAILED: Build did NOT complete successfully (0 packages loaded, 0 targets configured) ```

Nick_J · September 5, 2021, 3:37pm

Thanks! My MacBook doesn’t have a GPU so I’m focusing on performance on 4 CPUs like the EfficientDet benchmarks - I’ll try with the recommendations around optimising though the perf would need to improve by ~10x to match. Is it realistic to expect comparable performance to those benchmarks in the notebook?

My end goal will be to run on a device like an raspberry pi 4, jetson nano or something similar (min 4 CPUs, possibly GPU will be available).

Bhack · September 5, 2021, 4:25pm

Try with XNNPACK delegate.

Bhack · September 5, 2021, 10:28pm

You can also plan to use a Coral device/accelerator. See the benchmarks at:

Nick_J · September 6, 2021, 6:06am

Yeah I’ll check that out - I was considering getting the coral dev board mini.

I didn’t realise the pixel 4 had an edge TPU which makes sense now as the benchmarks on the coral website match up with the notebook.

Might be worth adding that to the table footnote as to me (in my naivety) it seemed to imply that latency was achievable on CPU only. Thanks for all the help!!

Bhack · September 6, 2021, 7:44am

If you still need a Raspberry you can check also something like:

Might be worth adding that to the table footnote as to me (in my naivety) it seemed to imply that latency was achievable on CPU only.

I think that It is achievable on CPU as you can see in details from the benchmark section in:

https://tfhub.dev/tensorflow/lite-model/efficientdet/lite0/detection/metadata/1

I think that you problem on your MacOS is that the X86 experience Is generally optimized for the standard TF runtime but you can still try to achieve performance using TFlite with XNNPACK, if what your need for your model is already covered by what we have in the XNNPACK delegate, as it has ops fusion and SSE/AVX X86 kernels.

Nick_J · September 6, 2021, 9:01am

Thanks for those links. I couldn’t figure out how to pass xnnpack in the list of delegates in Python but I’ve rebuilt TF with tflite_with_xnnpack=true. With float16 quantization I’m able to get comparable performance. Int8 quantization didn’t yield any performance increase in my basic test. Though accuracy did seem to suffer but that is likely due to the fact they were trained on super small training set. I think I’ve got enough to go on now though - thank you for your help!

Bhack · September 6, 2021, 9:17am

Yes, this Is the recommended way for Desktop

See the current limits and required flags in support for quantized tflite models · Issue #999 · google/XNNPACK · GitHub

You can comment there for additional technical questions related to int8 support.