Gradient accumulation - strange behaviour

Hello, I have create a simple method for adding gradient accumulation (GA) support to a keras model, which simply overloads the train_step method. I have also added support for mixed precision and adaptive gradient clipping.

We have been using GA for various projects, and purely mechanically it seems to be working. However, from running simple benchmarks we don’t really see the added benefit of a large batch size, which gradient accumulation should give us.

Also, using mixed precision, we see all sorts of fun stuff happening. So I was wondering if there might be something wrong with the implementation. Can anyone see any bugs that could explain this?

Current implementation: