We have a model that consumes multiple ragged tensors in a batch. Our model runs perfectly fine on a single GPU. But the moment we introduce distributed training, its evaluation fails.
Note that the training during the distributed settings proceeds smoothly but it’s during the evaluation it fails. Since we cannot provide the original data and model, we are using we are providing a minimal snippet in the following notebook that reproduces the issue. You can use this Colab to reproduce the issue as well as a multi-GPU machine. We have verified on both and the issue persists.
More details are available here:
opened 11:35AM - 05 Nov 21 UTC
type:bug/performance
Please go to TF Forum for help and support:
https://discuss.tensorflow.org/ta… g/keras
If you open a GitHub issue, here is our policy:
It must be a bug, a feature request, or a significant problem with the documentation (for small docs fixes please send a PR instead).
The form below must be filled out.
**Here's why we have that policy:**.
Keras developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.
**System information**.
- Have I written custom code (as opposed to using a stock example script provided in Keras): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Colab and Debian 10
- TensorFlow installed from (source or binary): Binary
- TensorFlow version (use command below): 2.6.0
- Python version:
- Bazel version (if compiling from source):
- GPU model and memory: V100 (16 GB)
- Exact command to reproduce:
You can collect some of this information using our environment capture script:
https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
You can obtain the TensorFlow version with:
python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
**Describe the problem**.
We have a model that consumes multiple ragged tensors in a batch. Our model runs perfectly fine on a single GPU. But the moment we introduce distributed training, its evaluation fails.
Note that the training during the distributed settings proceeds smoothly but it's during the evaluation it fails. Since we cannot provide the original data and model, we are using we are providing a minimal snippet in the following notebook that reproduces the issue. You can use Colab to reproduce the issue as well as a multi-GPU machine. We have verified on both and the issue persists.
**Describe the current behavior**.
Model consuming RaggedTensors fails during evaluation in a distributed setting.
**Describe the expected behavior**.
The model should run during evaluation without any errors when exposed to a distributed setting.
**[Contributing](https://github.com/keras-team/keras/blob/master/CONTRIBUTING.md)**.
- Do you want to contribute a PR? (yes/no): No.
- If yes, please read [this page](https://github.com/keras-team/keras/blob/master/CONTRIBUTING.md) for instructions
- Briefly describe your candidate solution(if contributing):
**Standalone code to reproduce the issue**.
Colab Notebook: https://colab.research.google.com/drive/1U9oeed5OMAH1KvN5T455kAsB2Nsh1-KF?usp=sharing.
**Source code / logs**.
```
ValueError: in user code:
/usr/local/lib/python3.7/dist-packages/keras/engine/training.py:1330 test_function *
return step_function(self, iterator)
/usr/local/lib/python3.7/dist-packages/keras/engine/training.py:1319 step_function **
data = next(iterator)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/input_lib.py:693 __next__
return self.get_next()
/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/input_lib.py:744 get_next
self, self._strategy, return_per_replica=False)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/input_lib.py:611 _get_next_as_optional
iterator._iterators[i].get_next_as_list()) # pylint: disable=protected-access
/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/input_lib.py:1990 get_next_as_list
strict=True,
/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:206 wrapper
return target(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/deprecation.py:549 new_func
return func(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/control_flow_ops.py:1254 cond
return cond_v2.cond_v2(pred, true_fn, false_fn, name)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/cond_v2.py:95 cond_v2
op_return_value=pred)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py:1007 func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/input_lib.py:1989 <lambda>
lambda: _dummy_tensor_fn(data.element_spec),
/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/input_lib.py:1853 _dummy_tensor_fn
return nest.map_structure(create_dummy_tensor, value_structure)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/nest.py:869 map_structure
structure[0], [func(*x) for x in entries],
/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/nest.py:869 <listcomp>
structure[0], [func(*x) for x in entries],
/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/input_lib.py:1849 create_dummy_tensor
dummy_tensor, (row_splits,) * spec._ragged_rank, validate=False)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:206 wrapper
return target(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/ragged/ragged_tensor.py:745 from_nested_row_splits
result = cls.from_row_splits(result, splits, validate=validate)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:206 wrapper
return target(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/ragged/ragged_tensor.py:454 from_row_splits
return cls._from_row_partition(values, row_partition, validate=validate)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/ragged/ragged_tensor.py:348 _from_row_partition
return cls(values=values, internal=True, row_partition=row_partition)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/ragged/ragged_tensor.py:294 __init__
values.shape.with_rank_at_least(1)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/tensor_shape.py:1078 with_rank_at_least
raise ValueError("Shape %s must have rank at least %d" % (self, rank))
ValueError: Shape () must have rank at least 1
```
Cc: @Nilabhra
Cc: @anon1529149