How to generate GPU kernel and execute it duing the HLO optimization pass

I am trying to do autotuning duing the graph optimization phase, inspired by A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers. However, I am having trouble generating GPU kernel and executing it with single HloInstruction. Seeing, I think it is possible to execute the kernel duing the graph optimization phase, but it is hard to find the way to do it.

My question is,

  1. Is there a convenient way to generate a gpu kernel with single HloInstruction?

  2. Aside from ExecuteKernelOnStream, is there easier way to run the kernel?

  3. On what abstraction does the stream executor run the kernel?

Thank you!