Compute gradients across two layers using gradients calculated from a previous layer using tf.gradients or tf.GradientTape

I want to use the gradients of one layer to calculate the gradients of the layer that comes before it.

My motivation for doing this is, when I tried to use model parallelism using tf.device, I found out that backpropagation has been running on CPU. The entire Backprop started running on a chosen tf.device only after I wrapped the call to GradientTape(when it computes the gradient) within the tf.device context manager. Since the model is split, I want the backprop of each partition to execute on the device where that partition is placed.

Ideally, I would like to find out a method with which this oversimplified pseudocode is possible.

with tf.device(device_3):
   grad_3 = tf.gradients(loss, trainable_vars_of_partition_3)
 
with tf.device(device_2):
   grad_2 = tf.gradients(grad_3, trainable_vars_of_partition_2)

with tf.device(device_1):
   grad_1 = tf.gradients(grad_2, trainable_vars_of_partition_1)

grads = concat(grad_1, grad_2, grad_3)

If something like this exists then I would be overjoyed if you could point me in the right direction.

Unfortunately, I could not find something as simple as this. The next best approach that I could think of was using the gradients of one layer to find the gradients of a layer that comes before it. Using chain rule and backpropagation, I feel that this should be possible.

I created this toy example, solving which is the first step towards the final goal.

Let’s say we have a model with 3 dense layers without activations functions. X, Y as defined as follows:

x = tf.concat([tf.random.uniform([1, 10], minval=0, maxval=0.25),
               tf.random.uniform([1, 10], minval=0.25, maxval=0.5),
               tf.random.uniform([1, 10], minval=0.5, maxval=0.75),
               tf.random.uniform([1, 10], minval=0.75, maxval=1.),
                ], axis = 0)

y = tf.constant(0., shape=[4, 1])

d1 = tf.keras.layers.Dense(5, name='d1') 
d2 = tf.keras.layers.Dense(2, name='d2') 
d3 = tf.keras.layers.Dense(1, name='d3') 

I am using a tf.function in this toy example but an answer with eager mode enabled, using GradientTape will also be appreciated.

@tf.function
def tf_func(x, y, d1, d2, d3):
    # Using shortforms of these function helped the code look neater and more readable to me. 
    g = tf.gradients
    rs = tf.reduce_sum
    rm = tf.reduce_mean

    o1 = d1(x)
    o2 = d2(o1)
    o3 = d3(o2)

    l = tf.reduce_mean(tf.square(o3 - y))
    
    w3, w2, w1 = d3.trainable_variables, d2.trainable_variables, d1.trainable_variables

    tf.print('actual grads' + '=' * 80)

    dl_dw3 = g(l, w3)
    
    dl_dw2 = g(l, w2)
    tf.print('dl_dw2: \n', dl_dw2)

    dl_dw1 = g(l, w1)   

    tf.print()
    tf.print()
    
    tf.print('reference grads' + '=' * 80)
    dl_do1 = g(l, o1)
    dl_do2 = g(l, o2)
    tf.print('dl_do2: \n', dl_do2)
    dl_do3 = g(l, o3)

    dl_dw1 = g(l, w1)
    dl_dw2 = g(l, w2)
    dl_dw3 = g(l, w3)

    do3_o2 = g(o3, o2)
    do2_do1 = g(o2, o1)

    do3_w3 = g(o3, w3)
    do2_w2 = g(o2, w2)
    do1_w1 = g(o1, w1)


    tf.print('testing chain_rule method' + '=' * 80)
    
    # Added a 't' before derivatives to differentiate between ref_grads and grads obtained using chain rule

    tdl_do3 = g(l, o3) # same as ref_grads

    tdo3_dw3 = g(o3, w3) # same as ref_grads
    tdl_dw3 = [rm(tdl_do3) * tdo3_dw3[0], rm(tdl_do3) * tdo3_dw3[1]] # same as actual grads

    tdo3_do2 = g(o3, o2) # same as ref_grads

    tdl_do2 = tdo3_do2 * rm(tdl_do3, axis=0)  # same as ref_grads
    tf.print('tdl_do2: \n', tdl_do2)

    tdo2_dw2 = g(o2, w2) 
    tf.print('tdo2_dw2: \n', tdo2_dw2)
    
    tdl_dw2 = [tdo2_dw2[0] * rm(tdl_do2, axis=[1]), tdo2_dw2[1] * rm(tdl_do2, axis=[1])]
    tf.print('tdl_dw2: \n', tdl_dw2)

    return None 


tf_func(x, y, d1, d2, d3)

The output was:

actual grads================================================================================
dl_dw2: 
 [[[-3.04819393 -1.30051827]
 [5.02123785 2.14232159]
 [-0.260933906 -0.111328]
 [5.87596226 2.50699162]
 [1.9655633 0.838611722]], [-4.69162369 -2.0016911]]


reference grads================================================================================
dl_do2: 
 [[[-0.43842113 -0.187053293]
 [-0.889310718 -0.379426271]
 [-1.41650343 -0.604354143]
 [-1.94738865 -0.830857456]]]


testing chain_rule method================================================================================
tdl_do2: 
 [[[-0.43842113 -0.187053293]
  [-0.889310718 -0.379426271]
  [-1.41650343 -0.604354143]
  [-1.94738865 -0.830857456]]]
tdo2_dw2: 
 [[[2.10966444 2.10966444]
 [-3.48670244 -3.48670244]
 [0.22972326 0.22972326]
 [-3.95618558 -3.95618558]
 [-1.3790133 -1.3790133]], [4 4]]
tdl_dw2: 
 [[[-2.47443795 -1.05572414]
 [4.08957386 1.74482536]
 [-0.26944378 -0.114958748]
 [4.64023352 1.97976542]
 [1.61745286 0.690089643]], [[-4.69162369 -2.0016911]]]

For some reason, gradients wrt weights in tdl_dw2 and dl_dw2 differ slightly. Every value in tdl_dw2 is slightly less than dl_dw2 even though the gradients wrt biases are the same. I cannot figure out why.

The gradient of loss wrt to w3 is as expected.

I used tf.reduce_mean to replicate what tf.gradients was doing internally as far as I understand. Please correct me if I am wrong.

From Tensorflow’s documentations:

gradients() adds ops to the graph to output the derivatives of ys with respect to xs. It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys and for x in xs.

tf.gradients constructs symbolic derivatives of sum of ys w.r.t. x in xs.

Any guidance or help will be greatly appreciated, thank you.

Some Similar StackOverflow questions(there are many more):

  1. python - Compute gradients across two models - Stack Overflow
  2. python - Is it possible to acquire an intermediate gradient? (Tensorflow) - Stack Overflow
  3. automatic differentiation - Breaking TensorFlow gradient calculation into two (or more) parts - Stack Overflow

Here is a colab notebook with the code:

Do you think that your case is still in the perimeter of:

Hi Bhack,

Thank you for responding.

It looks like a great resource, and it might be quite helpful for me to try it out.

I am trying to implement Pipeline(data + model) parallelism. I am using multithreading to do so but it seems like GradientTape cannot follow all the operations, possibly due to multithreading. I reckon that wrapping the function that has been parallelised, within a tf.function and using tf.gradients, could solve this issue.

I will try out the implementation suggested in that issue and see if it can help me out.

Thank you once again.

1 Like

@markdaoust Do you know who can share more hints on this? I don’t know who is subscribed to the tfcore tag.

1 Like

There’s a lot going on in this question.

But you shouldn’t need to go that low level to do this.

Given Alex’s response I hope we can get that issue fixed eventually.

3 things that may help here in the mean time:

  1. GradientTape.gradient accepts an output_gradients argument. To run your gradient in stages you can do something like this:
with tf.GradientTape() as tape1:
   x = chunk1(input)

with tf.GradientTape() as tape2:
  tape2.watch(x)
  y = chunk2(x)
  loss = loss_fn(y, y_true)

g2, gx = tape2.gradient(loss, [chunk2.variables, x])

g2 = tape1.gradient(x, [chunk1.variables], output_gradient=gx)

I should add an example of this to the Advanced automatic differentiation guide

I’m going to guess that if you wrap your with tf.device over the with tf.GradientTape then the gradient operations go on that device.

  1. Another thing that could work would be to use @tf.custom_gradient. I’m pretty sure you can use a with tf.device(), tf.GradientTape() as tape: in the custom gradient function.

  2. For model parallelism mesh approaches have been gaining popularity. Where instead of assigning layers to devices you split layers across devices. But I don’t have an off-the-shelf solution for that.

2 Likes

Thanks a ton! Solution 1 worked like a charm.

I tested it out on TPU on Mnist with 4 partitions. If you would like to take a look at my toy example, whether it can be of any use, to serve as an example, I’d love to share it.

I also wanted to ask whether the same is possible using tf.gradients? Is the argument grad_ys in its documentation equivalent to output_gradients?

tf.gradients?

Maybe this is the most important question.

In the issue @Bhack linked to, Allen mentioned tf.gratient’s colocate_gradient_with_op option.

But that doesn’t exist in tf.gradients, only tf.compat.v1.gradients.

What’s going on?

Compare the code for tf.gradients to tf.compat.v1.gradients.

colocate_gradients_with_ops is not an option in tf2 because it is always set to True.
So maybe if you just use one big tf.gradients instead of tf.GradientTape, this issue just dissapears.

1 Like

From tf.gradients it says:

grad_ys is a list of tensors of the same length as ys that holds the initial gradients for each y in ys. When grad_ys is None, we fill in a tensor of '1’s of the shape of y for each y in ys. A user can provide their own initial grad_ys to compute the derivatives using a different initial gradient for each y (e.g., if one wanted to weight the gradient differently for each value in each y).

So yes, its the same thing, but hopefully not necessary at all in this case…

1 Like