Low performance of convolution with AOT

Hi everybody,

I currently try to benchmark the inference of models when using ONNX, C++ Tensorflow and Ahead-Of-Time (AOT) compilation.
The benchmark itself uses std::chrono to measure the runtime. To reduce fluctuation I 500 calls of the networks. The inputs are just random generated floats, since i’m not interested in the actual predictions.

Benchmark for DNN:
The benchmark results for simple FeedForward networks are somewhat comparable. All approaches of inference are at least within the same order of magnitude.

Benchmark CNN:
When creating a very simple CNN AOT takes much longer (a factor 100x) to run.

The CNN is fairly simple: The model takes 32 inputs, has 1 channel, kernel size ranges from 1 to 4 and the follow up feedforward network is decent from size (10 layers and 128 units).

My Question is: Why does the AOT network perform so bad. Is there a way to prevent this, for example by utilizing certain XLA flags?

After looking around I found out that this is fairly known problem atleast on GPU:
See here.

I have the feeling that AOT tried to be “clever” and map the convolution kernel in a way which results in a very large number of operations.

For example by reserving a buffer for each movement of the kernel window. This could atleast explain why I found a very large buffer for the filters in the header of the AOT model (x10 as large as a dense layer).

Thanks a lot for your time.
I appreciate this a lot

PS: If you are interested I can also upload the benchmark plots and a picture of the buffers.