Parallelising custom function in tensorflow using graph execution

Hi

I’m trying to do sensitivity analysis (forward mode autodiff) on a matrix and was hoping to parallelise the computations using Tensorflow. Here’s the code that I’m using to test if something like this is possible in TF:

def _forward(X, dX, W1, W2):
    Z1 = tf.matmul(X, tf.transpose(W1))
    dZ1 = tf.matmul(dX, tf.transpose(W1))

    A1 = tf.tanh(Z1)
    dA1 = tf.multiply(tf.expand_dims(1-tf.square(tf.tanh(Z1)), axis=1), dZ1)

    Z2 = tf.matmul(A1, tf.transpose(W2))
    dZ2 = tf.matmul(dA1, tf.transpose(W2))

    return Z2, tf.squeeze(dZ2, axis=-1)

In the code above, the evaluation of Z1 and dZ1 is independent of each other (same thing for A1, dA1, etc. etc.) so I was hoping to run these pair of statements in parallel. I wrapped this function around tf.function and was hoping for speedup as compared to standard way of computing gradients (forward and backprop) because now I’ll be running half the calculations in parallel. However, I don’t see any speedups and both codes take the same time to execute.

I don’t know if it’s possible to do what I’m trying here. Any help would be appreciated.

Thanks

Have you tried to jit compile It:

I just tried this and see no improvement (both functions take same time to execute). Here are both the functions:

@tf.function(jit_compile=True)
def _forward(X, dX, W1, W2):
    Z1 = tf.matmul(X, tf.transpose(W1))
    dZ1 = tf.matmul(dX, tf.transpose(W1))

    A1 = tf.tanh(Z1)
    dA1 = tf.multiply(tf.expand_dims(1-tf.square(tf.tanh(Z1)), axis=1), dZ1)

    Z2 = tf.matmul(A1, tf.transpose(W2))
    dZ2 = tf.matmul(dA1, tf.transpose(W2))

    return Z2, tf.squeeze(dZ2, axis=-1)
@tf.function(jit_compile=True)
def forward(X, W1, W2):
    with tf.GradientTape(persistent=True) as tape:
        Z1 = tf.matmul(X, tf.transpose(W1))
        
        A1 = tf.tanh(Z1)
        
        Z2 = tf.matmul(A1, tf.transpose(W2))


    dZ2 = tape.gradient(Z2, X)
    return Z2, dZ2

I’ve always used pytorch for my work so I might not be doing things properly.

Thanks

Can you log the device placement, if you can isolate this function with a dummy input:

https://www.tensorflow.org/api_docs/python/tf/debugging/set_log_device_placement

You can control some multi-thread parallelism with:

Module: tf.config.threading  |  TensorFlow Core v2.8.0

I can’t figure out how to work these out. I don’t understand how I’m supposed to use this.

P.S. I’m using GPU so don’t understand how this might help.

Oh, for GPU i suppose that the single stream execution design is still valid:

Probably then new runtime could handle also multiple streams:

https://groups.google.com/a/tensorflow.org/g/tfrt/c/gTfSwZexVQk

But as It seems to depend also on the compiler we need to ask to other team members. /cc @markdaoust @Mehdi_AMINI @Jacques_Pienaar

In the meantime see more at:

Yes I think that is a good point - if all the computations are assigned to GPU and all are being executed on a single stream then you won’t get a speedup from additional parallelism. What I’d do here normally is a bit low level (that’s the area I’m normally :slight_smile: ) and dump the graph to see if there are any unexpected edges that are inhibiting control flow (these days I’d dump the Graphdef, convert to TFG (tfg-translate tool) and then look at the output as it is readable, before TFG I’d pipe it through to Graphviz file) and then run with vmodule=executor=1 to see exactly what’s being run and where (it can produce a lot of output even for small graphs).

I don’t know who from TFRT team is on here to ping in for comment, let me check :slight_smile:

1 Like

Do you think that it could be visualized with:

TF still has a single compute stream (transfers and NCCL ops each use a different stream) per session. This is not trivial to change because the GPU memory allocator implicitly synchronizes all allocations on that compute stream.

We are implementing multi-stream support in the HLO to TFRT compiler, but that’s not ready yet.

1 Like

@Jacques_Pienaar @csigg Thank you for your replies.

It is a little bit hard today, with all the moving compilers and runtime WIP parts, to really understand what kind of code it is really going to be produced and scheduled on the hw.

I think that the GAP It is still too large between people working every day on compilers+runtimes vs people that are just trying to figure out performance (gap?) on their high level API compositional path.

I hope that we can try to improve the current status reducing this gap having more usability oriented documentation and tools on the high level (user) side of the spectrum.

If not It will be really hard to interact on a common playground about the performance.

Yes and no: it can be but that visualization is meant to be more ML practisioners and can elide control edges that obscures the model structure but are important for folks debugging scheduling questions.

1 Like

You are correct. There is a workstream ongoing around performance predictabilty (well predictabilty in general) here. But adjacent to that is what you mention about communication gap here today at technical level.

1 Like