New Keras Example, Gradient Centralization for Better Training Performance

rishit_dagli · June 19, 2021, 8:49am

This new code walkthrough by me on Keras.io talks about gradient centralization, a simple trick that can markedly speed up model convergence which is implemented in Keras in literally 10 lines of code. This can both speed up the training process and improve the final generalization performance of DNNs.

Further, this code example also shows the improvements on using Gradient Centralization while training on @Laurence_Moroney 's Horses v Humans dataset.

This was also my first time contributing to Keras examples and many thanks to @fchollet and @Sayak_Paul for helping all along!

markdaoust · June 21, 2021, 4:47pm

Interesting! That worked really well in the example.

I’m not quite getting the intuition for what this does/why it works. What’s your understanding? I think I understand why you keep the last axis, and what this would do with SGD. But it’s less clear when applied through one of these more complex optimizers. Can you summarize your understanding of it (without using the word “Lipschitzness” ).

rishit_dagli · June 23, 2021, 6:00am

Hi @markdaoust,
Thanks so much for taking a look at this example!

Here is my intuition behind this and why it works in the first place after reading the paper especially for this example.

As I understand GC calculates the mean of the column matrices in our gradient matrix and subtract this mean from each column matrices to have zero mean because unlike the first thing we probably thought, normalizing gradients, does not work very well. About how I understand this works, we could say that in an intuitive way (without notation):

weights (with GC) = gradient of L with respect to W matrix (the standard term) - mean x (gradient of L with respect to W matrix)

We can see that our modified weights now can be seen as the projected gradients along; a unit vector with the number of columns in the weight matrix as the dimension. And similar to the intuition behind batch normalization; this would constrict the weight to a hyperplane. So, this would allow us to regularize the weight space and improve the generalization capability specially when there are less examples (like shown in the example). And this would work across all kind of optimizers, right?

The paper also talks about regularizing output feature space, if we use GC to update weights for SGD based optimizers, for a feature x and x + some scalar, the paper derives that change of output activation caused is only dependent on the scalar and the mean of initial weights, not the final weights. So if the mean for initial weights is very close to 0, we end up with making the output feature space more robust to variations in training data. This would also work on complex optimizers derived from SGD.

Apart of this in my opinion another major aspect showed by the original paper in section 4.2 is when they try to compare original loss and the constrained loss to show how the optimization can be smoothened out reducing training time, howveer they straight up derive this with Lipschitzness, unlike others for which I shared my understanding of them geometrically.