Where actually is max_pool_with_argmax implemented?

Hello,

in current Tensorflow 2.8.0 nn.max_pool_with_argmax is defined in nn_ops.py as max_pool_with_argmax_v2, saying

The indices returned are always in [0, height) x [0, width) before flattening, even if padding is involved and the mathematically correct answer is outside (either negative or too large). This is a bug, but fixing it is difficult to do in a safe backwards compatible way, especially due to flattening.

In order to understand what is actually going on, I wanted to take a look at how the ‘eager’ version of this operation is implemented. For this, I saw that in gen_nn_ops.py's max_pool_with_argmax TFE_Py_FastPathExecute is called with an argument"MaxPoolWithArgmax". Now I am stuck: Where is this function call dispatched to? Where is the code that actually executes the max pooling?

I think these are the CPU and GPU c++/cuda kernels: