Bug? TimeDistributed model compatibility

andreped · June 6, 2022, 2:18am

Just came across something rather strange. Seems like some architectures in keras.applications does not work directly with TimeDistributed.

For an example, you have the three architectures MobileNetV2, *V3, and a ConvNeXt-S architecture. If you use MobileNetV2 it works, but for the two others it does not.

import keras
from keras.applications import MobileNetV2, MobileNetV3Small, ConvNeXtSmall

input_ = keras.layers.Input(shape=(8, 224, 224, 3))

# base_model  = MobileNetV2(include_top=True, input_shape=(224, 224, 3))
base_model = MobileNetV3Small(include_top=True, input_shape=(224, 224, 3))
# base_model = ConvNeXtSmall(include_top=True, input_shape=(224, 224, 3))

output = keras.layers.TimeDistributed(base_model)(input_)
model = keras.Model(inputs=input_, outputs=output)

Tested with Python 3.8.10 and keras-nightly==2.10.0.dev2022060507. I have also tested with older versions of TF/Keras, but ConvNeXt is a new model application, and hence, I could only observe this behaviour for this model in the nightly build. However, for MobilNetV3 I saw the same using TF==2.8.0.

If you see the error prompts below, you can see that keras complains about two layers, where compute_output_shape cannot dynamically be catched. I have tried enabling eager mode, but same behaviour. Perhaps I am doing something wrong?

Error prompt for MobileNetV3:

Traceback (most recent call last):
File “.\test_timedistributed.py”, line 12, in
output = keras.layers.TimeDistributed(base_model)(input_)
File “C:\Users\47955\workspace\sandbox\venv\lib\site-packages\keras\utils\traceback_utils.py”, line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File “C:\Users\47955\workspace\sandbox\venv\lib\site-packages\keras\engine\base_layer.py”, line 879, in compute_output_shape
raise NotImplementedError(
NotImplementedError: Exception encountered when calling layer “time_distributed” (type TimeDistributed).

Please run in eager mode or implement the compute_output_shape method on your layer (TFOpLambda).

Call arguments received by layer “time_distributed” (type TimeDistributed):
• inputs=tf.Tensor(shape=(None, 8, 224, 224, 3), dtype=float32)
• training=False
• mask=None

and for ConvNeXtSmall:

Traceback (most recent call last):
File “.\test_timedistributed.py”, line 10, in
output = keras.layers.TimeDistributed(base_model)(input_)
File “C:\Users\47955\workspace\sandbox\venv\lib\site-packages\keras\utils\traceback_utils.py”, line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File “C:\Users\47955\workspace\sandbox\venv\lib\site-packages\keras\engine\base_layer.py”, line 879, in compute_output_shape
raise NotImplementedError(
NotImplementedError: Exception encountered when calling layer “time_distributed” (type TimeDistributed).

Please run in eager mode or implement the compute_output_shape method on your layer (LayerScale).

Call arguments received by layer “time_distributed” (type TimeDistributed):
• inputs=tf.Tensor(shape=(None, 8, 224, 224, 3), dtype=float32)
• training=None
• mask=None

andreped · June 6, 2022, 5:34pm

As this seems like a bug inside Keras, I have posted an issue in the keras repo.
However, if anyone spots a simple fix, please, let me know.

github.com/keras-team/keras

Gradient accumulation support?

opened 04:51AM - 02 Jun 22 UTC

andreped

type:feature

**Describe the feature and the current behavior/state:** Gradient accumulatio…n is extremely useful when working with large images/volumetric data, using low-end hardware, or training on multiple GPUs. For me, the most important feature is to be able to use larger batch sizes without exhausting memory. Currently, there does not seem to be a straightforward way to use gradient accumulation in Keras. **What I have tried:** In TF1, we created a wrapper that can be used on any optimizer, which changes how and when the update should happen. I have tried to implement such a method in TF2, greatly inspired by the attempt by other developers at TF-addons, such as @fsx950223 and @stefan-falk https://github.com/tensorflow/addons/pull/2525. However, I have not managed to get expected behaviour (see [here](https://github.com/andreped/GradientAccumulator/issues/2) to see some of the experiments I performed, and [here](https://github.com/andreped/GradientAccumulator/blob/main/GradientAccumulator/accumulator.py#L7) for the optimizer wrapper implementation). I therefore looked around for alternative solutions and found [this suggestion](https://stackoverflow.com/questions/66472201/gradient-accumulation-with-custom-model-fit-in-tf-keras/66524901#66524901) on stack overflow. I have expanded upon this idea and it seems to be working. After some thorough debugging and benchmarking, I have made a simple solution available in [this repo](https://github.com/andreped/GradientAccumulator), such that at least one simple solution for GA exists in TF2. **Proposed solution:** The idea is extremely simple. Overload the train_step method of the tf.keras.Model and add gradient accumulation support there. In the end, I have produced [a simple model wrapper](https://github.com/andreped/GradientAccumulator/blob/main/GradientAccumulator/GAModelWrapper.py), which does that for you, which you _ideally_ should be able to apply to _any_ tf.keras.Model to enable gradient accumulation, like so: ``` model = tf.keras.Model(...) model = GAModelWrapper(n_gradients=k, inputs=model.input, outputs=model.output) ``` However, there is definitely some work left to be done to make it handle _all_ scenarios, but it seems to be working fine on the use cases I have tested until now. **So what remains?** Currently, I am unsure whether this is the best approach. Perhaps there is a better way of solving this. A challenge might be to get distributed training working with multiple GPUs. I believe that was the biggest obstacle with the optimizer wrapper solution. Are there any devs working on adding gradient accumulation support in Keras? **Are you willing to contribute it:** Yes