innat
July 27, 2023, 12:03pm
#1
I am trying to translate a torch implementation in TensorFlow and faced some gradient-level issues while implementing it in TensorFlow. I already asked here for reproducible code.
opened 11:35AM - 15 Jul 23 UTC
type:bug
comp:ops
comp:core
TF 2.12
### Issue type
Bug
### Have you reproduced the bug with TensorFlow Nightly… ?
Yes
### Source
source
### TensorFlow version
tf 2.12
### Custom code
Yes
### OS platform and distribution
_No response_
### Mobile device
_No response_
### Python version
_No response_
### Bazel version
_No response_
### GCC/compiler version
_No response_
### CUDA/cuDNN version
_No response_
### GPU model and memory
_No response_
### Current behavior?
I have run a PyTorch code that computes the gradient of the gradient w.r.t some computation. It works just fine. Now, I want to translate PyTorch code into TensorFlow but got some errors.
## Standalone code to reproduce the issue
Here is the reproducible code. [Gist](https://colab.research.google.com/drive/1GPhctZNrXynrCQ0qNbLyMDmuixQtC0fw?usp=sharing).
The above collab is small and quickly reproduces the run of PyTorch and TensorFlow. PyTorch runs as expected but TensorLow doesn't. Below is the main spot to look at:
**Main Part**
In PyTorch,
```python
rand_model = Rnadom()
model = Model()
ran_optim = torch.optim.SGD(
ran_model.parameters()
)
model_params = model.parameters()
loss_mod = model.forward(x)
loss_rand = model.forward(y)
model_grad = torch.autograd.grad(loss_mod, model_params)
rand_grad = torch.autograd.grad(
loss_rand,
model_params,
create_graph=True
)
loss = some_method(model_grad, rand_grad)
rand_model.zero_grad()
loss.backward()
ran_optim.step()
```
In `pytorch`, the above `create_graph=True` is crucial.
In TensorFlow, I tried
```python
ran_model = Random()
ran_optim = tf.keras.optimizers.SGD()
model = Model()
model.build(input_shape=(1, 784))
optim = tf.keras.optimizers.SGD(0.01)
model_params = model.trainable_variables
with tf.GradientTape(persistent=True) as tape:
tape.watch(ran_model.trainable_variables)
loss_mod = tf.reduce_mean(tf.math.log(model(x)[:, i]))
loss_rand = tf.reduce_mean(tf.math.log(model(y)[:, i]))
grads_mod = tape.gradient(loss_mod, model_params)
grads_rand = tape.gradient(loss_rand, model_params)
loss = some_method(model_grad, rand_grad)
ran_model_grads = tape.gradient(loss, ran_model.trainable_variables)
ran_optim.apply_gradients(
zip(ran_model_grads, ran_model.trainable_variables)
)
```
The `tf` code gives the following error.
```yaml
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-01562609cda8> in <cell line: 33>()
44 loss += tf.reduce_sum(tf.stack([a, b], axis=0))
45 ran_model_grads = tape.gradient(loss, ran_model.trainable_variables)
---> 46 ran_optim.apply_gradients(zip(ran_model_grads, ran_model.trainable_variables))
47
48
3 frames
/usr/local/lib/python3.10/dist-packages/keras/optimizers/utils.py in filter_empty_gradients(grads_and_vars)
75 if not filtered:
76 variable = ([v.name for _, v in grads_and_vars],)
---> 77 raise ValueError(
78 f"No gradients provided for any variable: {variable}. "
79 f"Provided `grads_and_vars` is {grads_and_vars}."
ValueError: No gradients provided for any variable: (['Variable:0'],). Provided `grads_and_vars` is ((None, <tf.Variable 'Variable:0' shape=(10, 1, 784) dtype=float32, numpy=
```
- This is probably because the `ran_model_grads, ran_model.trainable_variables` are not connected. As mentioned in this [doc](https://www.tensorflow.org/guide/autodiff),
> When a **target** is not connected to a **source**, the gradient will return `None`
- In PyTorch, `create_graph=True` is used to compute the gradient of the gradient in the later part. To compute [grad-of-grad](https://www.tensorflow.org/guide/advanced_autodiff#example_input_gradient_regularization), but didn't work (shown below). The reason probably is the same as before, source and target are not connected.
```python
for i in range(5):
with tf.GradientTape() as tape1:
loss_mod = tf.reduce_mean(tf.math.log(model(x)[:, i]))
grads_mod = tape1.gradient(loss_mod, model_params)
with tf.GradientTape() as tape3:
with tf.GradientTape() as tape2:
loss_rand = tf.reduce_mean(tf.math.log(model(y)[:, i]))
grads_rand = tape2.gradient(loss_rand, model_params)
loss = 0
for a, b in zip(grads_mod, grads_rand):
loss += tf.reduce_sum(tf.stack([a, b], axis=0))
[ISSUE] > ran_model_grads = tape3.gradient(loss, ran_model.trainable_variables)
ran_optim.apply_gradients(zip(ran_model_grads, ran_model.trainable_variables))
```
But in this case, how to resolve this in TensorFlow?
If you have any suggestions or feedback on it, that would be highly appreciated. Thank you.