Error message for grouped convolution backprop on CPU is uninformative

kqlu4156 · December 4, 2021, 7:56am

Hello, I recently learned that gradients and backprop for grouped convolution is not supported on CPU, as discussed in the following github threads:

github.com/keras-team/keras

group conv2d can't backprop properly

opened 02:00AM - 28 Nov 21 UTC

closed 02:58AM - 01 Dec 21 UTC

breadbread1984

type:support stat:awaiting response

Please go to TF Forum for help and support: https://discuss.tensorflow.org/ta…g/keras If you open a GitHub issue, here is our policy: It must be a bug, a feature request, or a significant problem with the documentation (for small docs fixes please send a PR instead). The form below must be filled out. **Here's why we have that policy:**. Keras developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow. **System information**. - Have I written custom code (as opposed to using a stock example script provided in Keras): no - OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 20.04 - TensorFlow installed from (source or binary): binary - TensorFlow version (use command below): 2.7.0 - Python version: 3.8.0 - Bazel version (if compiling from source): n/a - GPU model and memory: n/a - Exact command to reproduce: n/a You can collect some of this information using our environment capture script: https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh You can obtain the TensorFlow version with: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)" v2.7.0-rc1-69-gc256c071bb2 2.7.0 **Describe the problem**. Describe the problem clearly here. Be sure to convey here why it's a bug in Keras or why the requested feature is needed. **Describe the current behavior**. Group tf.keras.layers.Conv2D can forward propagate without problem but can't backward propagate properly. **Describe the expected behavior**. Group Conv2D should back propagate properly. **[Contributing](https://github.com/keras-team/keras/blob/master/CONTRIBUTING.md)**. - Do you want to contribute a PR? (yes/no): no - If yes, please read [this page](https://github.com/keras-team/keras/blob/master/CONTRIBUTING.md) for instructions - Briefly describe your candidate solution(if contributing): **Standalone code to reproduce the issue**. Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook. ```python import tensorflow as tf; inputs = tf.random.normal(shape = (4,224,224,256)); conv2d = tf.keras.layers.Conv2D(256, (3,3), groups = 4, padding = 'same'); with tf.GradientTape() as tape: outputs = conv2d(inputs); # NOTE: error occurs here grads = tape.gradient(outputs, conv2d.trainable_variables); ``` **Source code / logs**. Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem. ```shell Traceback (most recent call last): File "test.py", line 27, in <module> grads = tape.gradient(outputs, conv2d.trainable_weights); File "/home/xieyi/.local/lib/python3.8/site-packages/tensorflow/python/eager/backprop.py", line 1084, in gradient flat_grad = imperative_grad.imperative_grad( File "/home/xieyi/.local/lib/python3.8/site-packages/tensorflow/python/eager/imperative_grad.py", line 71, in imperative_grad return pywrap_tfe.TFE_Py_TapeGradient( File "/home/xieyi/.local/lib/python3.8/site-packages/tensorflow/python/eager/backprop.py", line 159, in _gradient_function return grad_fn(mock_op, *out_grads) File "/home/xieyi/.local/lib/python3.8/site-packages/tensorflow/python/ops/nn_grad.py", line 581, in _Conv2DGrad gen_nn_ops.conv2d_backprop_input( File "/home/xieyi/.local/lib/python3.8/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1247, in conv2d_backprop_input _ops.raise_from_not_ok_status(e, name) File "/home/xieyi/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 7107, in raise_from_not_ok_status raise core._status_to_exception(e) from None # pylint: disable=protected-access tensorflow.python.framework.errors_impl.InvalidArgumentError: Computed input depth 256 doesn't match filter input depth 64 [Op:Conv2DBackpropInput] ```

github.com/tensorflow/tensorflow

Grouped convolutions generate seriously obscure errors on CPU

opened 03:23PM - 03 Sep 21 UTC

frgfm

stat:awaiting tensorflower type:bug comp:ops TF 2.5

Hello there :wave: Today I ran into a cumbersome error that only happens whe…n running on CPU instead of GPUs. I tracked the source of the error to grouped convolutions and managed to make a reproducible minimal snippet. I happened to suspect that it was because of grouped convolutions since I ran into some problems a few days ago with those using SavedModels but it's pure luck. It would be good to improve the error message or even get this fixed if possible :pray: Happy to help provided some directions! **System information** - Have I written custom code: yes, the code snippet - OS Platform and Distribution: Linux Ubuntu 20.04 - TensorFlow installed from: binary, via pip - TensorFlow version: 2.5.0 - Python version: 3.8 - CUDA/cuDNN version: CUDA 11.4 (cuDNN 8.2.0) - GPU model and memory: NVIDIA GeForce RTX 2070 with Max-Q Design **Describe the current behavior** As of now, running the snippet further down below throws an error on CPU but not on GPU. **Describe the expected behavior** Simple: - having a better error (pointing the lack of support of grouped convolutions on CPU) - or even better, if that could get fixed :) **Standalone code to reproduce the issue** ```python import tensorflow as tf from tensorflow.keras import layers from tensorflow.keras.models import Sequential samples = tf.zeros((1, 256, 256, 3), dtype=tf.float32) model = Sequential([layers.Conv2D(18, padding='same', kernel_size=3, groups=1), layers.GlobalAveragePooling2D(), layers.Dense(1)]) trouble_model = Sequential([layers.Conv2D(18, padding='same', kernel_size=3, groups=3), layers.GlobalAveragePooling2D(), layers.Dense(1)]) # Backprop on classic model with tf.GradientTape() as tape: out = model(samples, training=True) grads = tape.gradient(out, model.trainable_weights) # Now with grouped conv with tf.GradientTape() as tape: out = trouble_model(samples, training=True) grads = tape.gradient(out, trouble_model.trainable_weights) ``` which runs successfully on GPU but on CPU throws the following: ``` --------------------------------------------------------------------------- InvalidArgumentError Traceback (most recent call last) <ipython-input-1-e03a8706f9a2> in <module> 19 with tf.GradientTape() as tape: 20 out = trouble_model(samples, training=True) ---> 21 grads = tape.gradient(out, trouble_model.trainable_weights) ~/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/backprop.py in gradient(self, target, sources, output_gradients, unconnected_gradients) 1072 for x in nest.flatten(output_gradients)] 1073 -> 1074 flat_grad = imperative_grad.imperative_grad( 1075 self._tape, 1076 flat_targets, ~/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/imperative_grad.py in imperative_grad(tape, target, sources, output_gradients, sources_raw, unconnected_gradients) 69 "Unknown value for unconnected_gradients: %r" % unconnected_gradients) 70 ---> 71 return pywrap_tfe.TFE_Py_TapeGradient( 72 tape._tape, # pylint: disable=protected-access 73 target, ~/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/backprop.py in _gradient_function(op_name, attr_tuple, num_inputs, inputs, outputs, out_grads, skip_input_indices, forward_pass_name_scope) 157 gradient_name_scope += forward_pass_name_scope + "/" 158 with ops.name_scope(gradient_name_scope): --> 159 return grad_fn(mock_op, *out_grads) 160 else: 161 return grad_fn(mock_op, *out_grads) ~/miniconda3/lib/python3.8/site-packages/tensorflow/python/ops/nn_grad.py in _Conv2DGrad(op, grad) 579 # in Eager mode. 580 return [ --> 581 gen_nn_ops.conv2d_backprop_input( 582 shape_0, 583 op.inputs[1], ~/miniconda3/lib/python3.8/site-packages/tensorflow/python/ops/gen_nn_ops.py in conv2d_backprop_input(input_sizes, filter, out_backprop, strides, padding, use_cudnn_on_gpu, explicit_paddings, data_format, dilations, name) 1245 return _result 1246 except _core._NotOkStatusException as e: -> 1247 _ops.raise_from_not_ok_status(e, name) 1248 except _core._FallbackException: 1249 pass ~/miniconda3/lib/python3.8/site-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name) 6895 message = e.message + (" name: " + name if name is not None else "") 6896 # pylint: disable=protected-access -> 6897 six.raise_from(core._status_to_exception(e.code, message), None) 6898 # pylint: enable=protected-access 6899 ~/miniconda3/lib/python3.8/site-packages/six.py in raise_from(value, from_value) InvalidArgumentError: Computed input depth 3 doesn't match filter input depth 1 [Op:Conv2DBackpropInput] ```

Nothing on the documentation for convolution with grouping indicated I would run into this issue, and the error message that came up when I attempted to run the optimizer (copied below) was very unhelpful for diagnosing the source of the issue.

InvalidArgumentError: Computed input depth 32 doesn't match filter input depth 4 [Op:Conv2DBackpropInput]

I did not expect to see a backprop operation for Conv2D when my model was only using Conv1D layers, and I only figured out the issue was related to grouping because I was very familiar with the parameters of the model I was running.

I believe adding the following features would make it much easier for future developers to diagnose the issue:

A more informative error message when you attempt to run backprop on a grouped convolution layer on CPU
Possibly a warning when you load a grouped convolution layer on CPU at all
A note in the documentation for convolution layers that gradients for grouped convolution is only supported on GPU

Thank you.

chunduriv · April 14, 2023, 7:45am

@kqlu4156,

Gradients for grouped convolutions are supported on CPU after Tensorflow 2.9.

import tensorflow as tf
print(tf.__version__)

inputs = tf.random.normal(shape = (4,10,128))
conv1d = tf.keras.layers.Conv1D(32, (3), groups = 4, padding = 'same') 
with tf.GradientTape() as tape:
  outputs = conv1d(inputs)
  grads_1d = tape.gradient(outputs, conv1d.trainable_variables)

print(grads_1d)

Output:

2.12.0
[<tf.Tensor: shape=(3, 32, 32), dtype=float32, numpy=
array([[[ 5.6082282e+00,  5.6082282e+00,  5.6082282e+00, ...,
         -5.7246876e+00, -5.7246876e+00, -5.7246876e+00],
        [-5.2481537e+00, -5.2481537e+00, -5.2481537e+00, ...,
         -1.5736814e+00, -1.5736814e+00, -1.5736814e+00],
        [ 9.2298737e+00,  9.2298737e+00,  9.2298737e+00, ...,
         -7.0252337e+00, -7.0252337e+00, -7.0252337e+00],
        ...,
        [-8.7338948e-01, -8.7338948e-01, -8.7338948e-01, ...,
          7.0602713e+00,  7.0602713e+00,  7.0602713e+00],
        [ 1.1597365e+00,  1.1597365e+00,  1.1597365e+00, ...,
          1.1881893e+00,  1.1881893e+00,  1.1881893e+00],
        [-4.6490440e+00, -4.6490440e+00, -4.6490440e+00, ...,
         -1.7533340e+01, -1.7533340e+01, -1.7533340e+01]],

       [[ 8.1101370e-01,  8.1101370e-01,  8.1101370e-01, ...,
         -3.8097646e+00, -3.8097646e+00, -3.8097646e+00],
        [-4.1635637e+00, -4.1635637e+00, -4.1635637e+00, ...,
         -9.5894051e-01, -9.5894051e-01, -9.5894051e-01],
        [ 1.2245650e+01,  1.2245650e+01,  1.2245650e+01, ...,
         -6.6381464e+00, -6.6381464e+00, -6.6381464e+00],
        ...,
        [ 1.6338730e+00,  1.6338730e+00,  1.6338730e+00, ...,
          7.8838940e+00,  7.8838940e+00,  7.8838940e+00],
        [ 1.5161037e-02,  1.5161037e-02,  1.5161037e-02, ...,
          1.5670588e+00,  1.5670588e+00,  1.5670588e+00],
        [-3.5443621e+00, -3.5443621e+00, -3.5443621e+00, ...,
         -1.6055473e+01, -1.6055473e+01, -1.6055473e+01]],

       [[-1.3902726e+00, -1.3902726e+00, -1.3902726e+00, ...,
         -4.0715046e+00, -4.0715046e+00, -4.0715046e+00],
        [-6.2915697e+00, -6.2915697e+00, -6.2915697e+00, ...,
         -1.0634291e+00, -1.0634291e+00, -1.0634291e+00],
        [ 1.0099386e+01,  1.0099386e+01,  1.0099386e+01, ...,
         -5.7550478e+00, -5.7550478e+00, -5.7550478e+00],
        ...,
        [ 2.5703638e+00,  2.5703638e+00,  2.5703638e+00, ...,
          7.6247263e+00,  7.6247263e+00,  7.6247263e+00],
        [ 2.0062921e+00,  2.0062921e+00,  2.0062921e+00, ...,
          6.0164762e-01,  6.0164762e-01,  6.0164762e-01],
        [-3.8103375e+00, -3.8103375e+00, -3.8103375e+00, ...,
         -1.3640256e+01, -1.3640256e+01, -1.3640256e+01]]], dtype=float32)>, <tf.Tensor: shape=(32,), dtype=float32, numpy=
array([40., 40., 40., 40., 40., 40., 40., 40., 40., 40., 40., 40., 40.,
       40., 40., 40., 40., 40., 40., 40., 40., 40., 40., 40., 40., 40.,
       40., 40., 40., 40., 40., 40.], dtype=float32)>]

Please find the gist for reference.

Thank you!