Tf Lite 2.5 vs 2.7 , setNumThreads(-1) behaves differently?

Uvin_Abeysinghe · February 7, 2022, 6:01am

Hi,

I’ve been comparing the performance(inference time) of my models between TF Lite 2.5 and 2.7. When working with TF Lite 2.5, I figured that setting the number of threads()(setNumThreads) as -1 worked well on average. The performance matched with using around 4 threads.

However, recently when I started working with TF Lite 2.7, I still set the number of threads as -1. The inference time matched with using 1 thread. Is this expected? I have only tested on Android at the moment.

Thanks,
Uvin

lgusm · February 7, 2022, 11:42am

Hi Uvin

Can you try the same benchmark with the 2.8 version released last week?

And there’s also a change in this api specifically: Release TensorFlow 2.8.0 · tensorflow/tensorflow · GitHub

Fergus_Henderson · February 22, 2022, 3:29pm

One factor that I think is related is that XNNPack support was enabled by default for the C++ API; I think that would have been sometime around the TF Lite 2.7 timeframe.

Looking at the source code, I see that num_threads == -1 is treated as single-threaded for XNNPack:

github.com

tensorflow/tensorflow/blob/21412c2cdf3300496fa4a285445db64ddc845603/tensorflow/lite/tflite_with_xnnpack.cc#L27

      
        
            #include "tensorflow/lite/c/common.h"
            #include "tensorflow/lite/delegates/xnnpack/xnnpack_delegate.h"
            
            
namespace tflite {
            // Corresponding weak declaration found in lite/tflite_with_xnnpack_optional.cc
            // when TFLITE_BUILD_WITH_XNNPACK_DELEGATE macro isn't defined.
            std::unique_ptr<TfLiteDelegate, void (*)(TfLiteDelegate*)>
            AcquireXNNPACKDelegate(int num_threads) {
              auto opts = TfLiteXNNPackDelegateOptionsDefault();
              // Note that we don't want to use the thread pool for num_threads == 1.
              opts.num_threads = num_threads > 1 ? num_threads : 0;
              return std::unique_ptr<TfLiteDelegate, void (*)(TfLiteDelegate*)>(
                  TfLiteXNNPackDelegateCreate(&opts), TfLiteXNNPackDelegateDelete);
            }
            }  // namespace tflite

Whereas for Eigen, which is used if XNNPack isn’t enabled, the default is to use 4 threads, and passing num_threads == -1 keeps the default:

github.com

tensorflow/tensorflow/blob/21412c2cdf3300496fa4a285445db64ddc845603/tensorflow/lite/kernels/eigen_support.cc#L32

      
        
            #include "tensorflow/lite/c/common.h"
            #include "tensorflow/lite/kernels/internal/optimized/eigen_spatial_convolutions.h"
            #include "tensorflow/lite/kernels/op_macros.h"
            
            
namespace tflite {
            namespace eigen_support {
            namespace {
            
            
// For legacy reasons, we use 4 threads by default unless the thread count is
            // explicitly specified by the context.
            const int kDefaultNumThreadpoolThreads = 4;
            
            
bool IsValidNumThreads(int num_threads) { return num_threads >= -1; }
            int GetNumThreads(int num_threads) {
              return num_threads > -1 ? num_threads : kDefaultNumThreadpoolThreads;
            }
            
            
#ifndef EIGEN_DONT_ALIGN
            // Eigen may require buffers to be aligned to 16, 32 or 64 bytes depending on
            // hardware architecture and build configurations.
            // If the static assertion fails, try to increase `kDefaultTensorAlignment` to

So, I suspect some of the operations in your model were previously implemented using Eigen, but with TF Lite 2.7 are now using XNNPack by default, and so you now get 1 thread by default rather than 4.

The documentation leaves the effect of num_threads == -1 deliberately underspecified:

  /// If set to the value -1, the number of threads used
  /// will be implementation-defined and platform-dependent.

I suspect that the intent was that -1 should correspond to a reasonable number of threads that is likely to give good performance. But your mileage may vary, as they say.

My advice: if multithreading is critical to the performance of your app, try calling setNumThreads(4) rather than setNumThreads(-1).