Parallelising custom function in tensorflow using graph execution

Tushar_Gautam · January 19, 2022, 5:16pm

Hi

I’m trying to do sensitivity analysis (forward mode autodiff) on a matrix and was hoping to parallelise the computations using Tensorflow. Here’s the code that I’m using to test if something like this is possible in TF:

def _forward(X, dX, W1, W2):
    Z1 = tf.matmul(X, tf.transpose(W1))
    dZ1 = tf.matmul(dX, tf.transpose(W1))

    A1 = tf.tanh(Z1)
    dA1 = tf.multiply(tf.expand_dims(1-tf.square(tf.tanh(Z1)), axis=1), dZ1)

    Z2 = tf.matmul(A1, tf.transpose(W2))
    dZ2 = tf.matmul(dA1, tf.transpose(W2))

    return Z2, tf.squeeze(dZ2, axis=-1)

In the code above, the evaluation of Z1 and dZ1 is independent of each other (same thing for A1, dA1, etc. etc.) so I was hoping to run these pair of statements in parallel. I wrapped this function around tf.function and was hoping for speedup as compared to standard way of computing gradients (forward and backprop) because now I’ll be running half the calculations in parallel. However, I don’t see any speedups and both codes take the same time to execute.

I don’t know if it’s possible to do what I’m trying here. Any help would be appreciated.

Thanks

Bhack · January 19, 2022, 5:52pm

Have you tried to jit compile It:

Tushar_Gautam · January 19, 2022, 11:43pm

I just tried this and see no improvement (both functions take same time to execute). Here are both the functions:

@tf.function(jit_compile=True)
def _forward(X, dX, W1, W2):
    Z1 = tf.matmul(X, tf.transpose(W1))
    dZ1 = tf.matmul(dX, tf.transpose(W1))

    A1 = tf.tanh(Z1)
    dA1 = tf.multiply(tf.expand_dims(1-tf.square(tf.tanh(Z1)), axis=1), dZ1)

    Z2 = tf.matmul(A1, tf.transpose(W2))
    dZ2 = tf.matmul(dA1, tf.transpose(W2))

    return Z2, tf.squeeze(dZ2, axis=-1)

@tf.function(jit_compile=True)
def forward(X, W1, W2):
    with tf.GradientTape(persistent=True) as tape:
        Z1 = tf.matmul(X, tf.transpose(W1))
        
        A1 = tf.tanh(Z1)
        
        Z2 = tf.matmul(A1, tf.transpose(W2))


    dZ2 = tape.gradient(Z2, X)
    return Z2, dZ2

I’ve always used pytorch for my work so I might not be doing things properly.

Thanks

Bhack · January 20, 2022, 12:06am

Can you log the device placement, if you can isolate this function with a dummy input:

https://www.tensorflow.org/api_docs/python/tf/debugging/set_log_device_placement

You can control some multi-thread parallelism with:

Module: tf.config.threading | TensorFlow Core v2.8.0

Tushar_Gautam · January 20, 2022, 5:31am

I can’t figure out how to work these out. I don’t understand how I’m supposed to use this.

P.S. I’m using GPU so don’t understand how this might help.

Bhack · January 20, 2022, 7:00am

Oh, for GPU i suppose that the single stream execution design is still valid:

Bhack · January 20, 2022, 9:38am

Probably then new runtime could handle also multiple streams:

https://groups.google.com/a/tensorflow.org/g/tfrt/c/gTfSwZexVQk

But as It seems to depend also on the compiler we need to ask to other team members. /cc @markdaoust @Mehdi_AMINI @Jacques_Pienaar

Bhack · January 20, 2022, 12:36pm

In the meantime see more at:

github.com

tensorflow/runtime/blob/master/documents/cuda-proposal.md#streams

# CUDA in TensorFlow Runtime

<!--*
# Document freshness: For more information, see go/fresh-source.
freshness: { owner: 'xldrx' reviewed: '2021-04-26' }
*-->

## CUDA in TensorFlow Runtime

This document describes the proposed design for supporting CUDA devices in
TensorFlow Runtime (TFRT). Please note this preliminary proposal was written in
the early stages of the TFRT project and the overall design has changed
considerably. We are working on an updated document.

<!-- TOC -->

## Introduction

TFRT supports CUDA by defining a set of ops which are collectively called the
CUDA RunTime dialect or CRT dialect for short[^1]. This document discusses key

This file has been truncated. show original

Jacques_Pienaar · January 21, 2022, 12:34pm

Yes I think that is a good point - if all the computations are assigned to GPU and all are being executed on a single stream then you won’t get a speedup from additional parallelism. What I’d do here normally is a bit low level (that’s the area I’m normally ) and dump the graph to see if there are any unexpected edges that are inhibiting control flow (these days I’d dump the Graphdef, convert to TFG (tfg-translate tool) and then look at the output as it is readable, before TFG I’d pipe it through to Graphviz file) and then run with vmodule=executor=1 to see exactly what’s being run and where (it can produce a lot of output even for small graphs).

I don’t know who from TFRT team is on here to ping in for comment, let me check

Bhack · January 21, 2022, 12:58pm

Do you think that it could be visualized with:

csigg · January 21, 2022, 1:07pm

TF still has a single compute stream (transfers and NCCL ops each use a different stream) per session. This is not trivial to change because the GPU memory allocator implicitly synchronizes all allocations on that compute stream.

We are implementing multi-stream support in the HLO to TFRT compiler, but that’s not ready yet.

Bhack · January 21, 2022, 1:27pm

@Jacques_Pienaar @csigg Thank you for your replies.

It is a little bit hard today, with all the moving compilers and runtime WIP parts, to really understand what kind of code it is really going to be produced and scheduled on the hw.

I think that the GAP It is still too large between people working every day on compilers+runtimes vs people that are just trying to figure out performance (gap?) on their high level API compositional path.

I hope that we can try to improve the current status reducing this gap having more usability oriented documentation and tools on the high level (user) side of the spectrum.

If not It will be really hard to interact on a common playground about the performance.

Jacques_Pienaar · January 21, 2022, 1:36pm

Yes and no: it can be but that visualization is meant to be more ML practisioners and can elide control edges that obscures the model structure but are important for folks debugging scheduling questions.

Jacques_Pienaar · January 21, 2022, 1:39pm

You are correct. There is a workstream ongoing around performance predictabilty (well predictabilty in general) here. But adjacent to that is what you mention about communication gap here today at technical level.