Gradient accumulation - strange behaviour

andreped · September 20, 2022, 2:20pm

Hello, I have create a simple method for adding gradient accumulation (GA) support to a keras model, which simply overloads the train_step method. I have also added support for mixed precision and adaptive gradient clipping.

We have been using GA for various projects, and purely mechanically it seems to be working. However, from running simple benchmarks we don’t really see the added benefit of a large batch size, which gradient accumulation should give us.

Also, using mixed precision, we see all sorts of fun stuff happening. So I was wondering if there might be something wrong with the implementation. Can anyone see any bugs that could explain this?

Current implementation:

github.com

andreped/GradientAccumulator/blob/8974dbf0c743b1a38eecb529a6b33ebda3829105/gradient_accumulator/GAModelWrapper.py

import tensorflow as tf
from . import agc


# https://stackoverflow.com/a/66524901
# https://keras.io/guides/customizing_what_happens_in_fit/
@tf.keras.utils.register_keras_serializable()  # adding this avoids needing to use custom_objects when loading model
class GAModelWrapper(tf.keras.Model):
    def __init__(self, accum_steps=1, mixed_precision=False, use_agc=False, clip_factor=0.01, eps=1e-3, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.accum_steps = tf.constant(accum_steps, dtype=tf.int32, name="accum_steps")
        self.accum_step_counter = tf.Variable(0, dtype=tf.int32, trainable=False, name="accum_counter")
        self.gradient_accumulation = [tf.Variable(tf.zeros_like(v, dtype=tf.float32), trainable=False, name="accum_" + str(i)) for i, v in
                                      enumerate(self.trainable_variables)]
        self.mixed_precision = mixed_precision
        self.use_agc = use_agc
        self.clip_factor = clip_factor
        self.eps = eps

    # @tf.function  # https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch#speeding-up_your_training_step_with_tffunction

This file has been truncated. show original