No registered '_FusedBatchNormEx' OpKernel in graph mode

Writing a pluggable device, I have been using eager execution all along to test it and was able to run a few models including Resnet50.
Now that I am trying to use the graph mode, I get into this issue:

2022-08-08 09:51:26.143724: W tensorflow/core/grappler/utils/graph_view.cc:849] No registered '_FusedBatchNormEx' OpKernel for PPU devices compatible with node {{node resnet50/conv5_block3_2_bn/FusedBatchNormV3}}
        .  Registered:  device='XLA_CPU_JIT'; U in [DT_FLOAT]; T in [DT_FLOAT, DT_BFLOAT16, DT_HALF]
  device='CPU'; T in [DT_BFLOAT16]; U in [DT_FLOAT]
  device='CPU'; T in [DT_FLOAT]; U in [DT_FLOAT]

(PPU is the name of my device).

This to me looks like some optimisations like fusion are done by grappler and my plugin doesn’t implement the resulting kernels.

A few questions come to me:

  1. What kernels should I implement, or what should I refer to to implement the kernels resulting from the grappler? In particular and as an example, _FusedBatchNormEx doesn’t seem to have any documentation.
  2. How can I deactivate some optimisations in order to get it to run?
    Referring to the tutorial on pluggable device, it seems that I can deactivate some optimisations in TF_InitGraph, for example params->optimizer_configs->remapping = TF_TriState_Off;. However when implementing TF_InitGraph I have to provide an optimizer optimize function, and at this stage of my development I don’t have any custom optimisation to provide and I don’t know how to write a dummy function that would be correct.

Thank you for your interest.

@penporn might be able to help.

I’m no expert but but for disabling optimizations, is there anything in this doc that helps?:

Thanks for tagging me, @markdaoust!

Hi @slai-nick,

Yes, _FusedBatchNormEx was fused by the grappler remapper pass.

  1. What kernels should I implement, or what should I refer to to implement the kernels resulting from the grappler?

According to the comment in the remapper pass.FusedBatchNormEx supports two fusion patterns:

  1. FusedBatchNorm + <Activation>
  2. FusedBatchNorm + SideInput + <Activation>

You can see an example implementation for the GPU device here. (I got this from the kernel registration).

  1. How can I deactivate some optimisations in order to get it to run?

Please see the example in the Grappler C API RFC. It shows how to disable the remapper pass:

void TF_InitGraphPlugin(TP_OptimizerRegistrationParams* params, TF_Status* status) {
  params->device_type = "GPU";
  // Define some flags indicating whether existing optimizers should be turned on/off
  params->configs->remapping = TF_TriState_Off;
  ...

The tutorial also has an example plug-in code.

  • Example TF_InitGraph function. (The name should be TF_InitGraph and not TF_InitGraphPlugin.)
  • Example Optimize function.

For a real example, Microsoft has open-sourced their TF-DirectML plug-in code here.