Performance difference of fused Batch Norm op vs standard

I wonder, given automatic op fusing via XLA, if there is any performance difference of the fused batch norm op (fused option in Keras, or tf.compat.v1.nn.fused_batch_norm) vs a vanilla batch norm implementation (standard Keras implementation or just some own custom one).

We made some statistics long time ago here but those were for some TF 1 version, and I think without XLA, so these are probably not relevant anymore.