Performance overhead of tensorflow custom ops

Hi all – I have written some tensorflow custom ops that replace stock tensorflow kernels. When benchmarking the kernels in isolation in a C++ test bench my kernels have comparable performance with tensorflow kernels. However when I use the custom op framework to use my kernels in tensorflow serving, it becomes a lot slower. Does anybody here have any ideas why this could be?

Have you benchmarked your kernel in isolation also on TF serving or in a model?