How to set the device for the tensor?

Stonepia · June 17, 2021, 2:45pm

I am curious about how to set the Device in TF. I want to implement a custom distributed data parallel algorithm, and I want to say , for example, split input tensor x into three parts and transfer it to three devices.

so basically, I want to

x0, x1, x2 = tf.split(x, num_or_size_splits=3, axis=1)
x0 = x0.to('device:0')
x1 = x1.to('device:1')
x2 = x2.to('device:2')

But this seems quite impossible in TF.

I found one is about colocation_graph, should I use that?

lgusm · June 17, 2021, 4:14pm

you can do that using the with(device)

does it help?

Stonepia · June 18, 2021, 12:57am

Thanks for the reply, sorry for this unclear question. The with context manager only work for python, IMHO.

However, if I want to implement a data parallel, I would have to rewrite the default TF’s pass, in that case, how would I handle this in C++? Because as far as I know, TF’s tensor does not have the device’s information.

lgusm · June 18, 2021, 9:43am

Humm, I don’t know.

Is this for the training step?
I lack the background but maybe this Distributed training with TensorFlow | TensorFlow Core might be able to give some insights

Bhack · June 18, 2021, 2:11pm

Are you looking for creating your own custom distributed strategy?

Cause I don’t think that we officially support this:

github.com/tensorflow/tensorflow

Custom Distributed Strategy

opened 02:14PM - 09 Sep 19 UTC

closed 10:51AM - 25 Jun 20 UTC

jpadrao

type:feature comp:dist-strat TF 1.14

<em>Please make sure that this is a feature request. As per our [GitHub Policy](…https://github.com/tensorflow/tensorflow/blob/master/ISSUES.md), we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:feature_template</em> **System information** - TensorFlow version (you are using): 1.14.0 - Are you willing to contribute it (Yes/No): Yes **Describe the feature and the current behavior/state.** As Tensorflow stands, there is no easy and intuitive way to implement a new distribution strategy. The ones available, (MirroredStrategy, MultiWorkerMirroredStrategy, ...), work fine but the code seems very complex, and there isn't a tutorial/guide on how to develop a new one **Will this change the current api? How?** The api cloud be restructured to ease the support of new distributed strategy. A tutorial/guide on how to develop one would also be appreciated **Who will benefit with this feature?** Researchers who want to reduce the time spent training distributed tensorflow models **Any Other info.**

Stonepia · June 20, 2021, 2:13pm

Thanks for the reply! Yes, I am trying to create my own custom distributed strategy, but it seems that doing this in TF is causing a lot of trouble…

Bhack · June 20, 2021, 3:57pm

You can try to look at

github.com

tensorflow/tensorflow/blob/c35883e15a675767c15b8f3c5ed619bd9e051af4/tensorflow/python/distribute/distribute_lib.py#L16:L25

      
        
            #
            #     http://www.apache.org/licenses/LICENSE-2.0
            #
            # Unless required by applicable law or agreed to in writing, software
            # distributed under the License is distributed on an "AS IS" BASIS,
            # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
            # See the License for the specific language governing permissions and
            # limitations under the License.
            # ==============================================================================
            # pylint: disable=line-too-long
            """Library for running a computation across multiple devices.
            
            
The intent of this library is that you can write an algorithm in a stylized way
            and it will be usable with a variety of different `tf.distribute.Strategy`
            implementations. Each descendant will implement a different strategy for
            distributing the algorithm across multiple devices/machines.  Furthermore, these
            changes can be hidden inside the specific layers and other library classes that
            need special treatment to run in a distributed setting, so that most users'
            model definition code can run unchanged. The `tf.distribute.Strategy` API works
            the same way with eager and graph execution.