Detailed CUDA Implementation of TensorFlow

dl_xiaocaiji · November 20, 2021, 12:52am

We are currently investigating how to deploy TensorFlow 2 for custom OP/deep learning on our product. We currently understand that a session is the place to execute a TensorFlow graph, which may include both deep learning OPs or self-defined (custom) OPs. To find out the best software architecture for our product, we would like to know how a tensorflow session is implemented in CUDA. Specifically, we would like to understand stream management, memory copy and compute execution in a session. In more depth, we would like to understand the implementation at OS level, e.g., running different sessions in a single process and in multiple processes. Is there a particular document for our questions from tf2?

Bhack · November 22, 2021, 4:08pm

I suppose that something it is going to change with the new Tensorflow runtime:

github.com/tensorflow/tensorflow

How to create new cuda stream in custom op

opened 10:37PM - 10 Feb 20 UTC

closed 07:03AM - 03 Apr 20 UTC

chaaland

stat:awaiting tensorflower type:support comp:gpu

I've written a custom op for the GPU that takes X with shape ```(n,p)``` and Y w…ith shape ```(m,p)``` and returns Z with shape ```(n,m)```. In the backward pass, dX and dY are independent computations so I'd like to have launch two separate CUDA kernels and have them run in separate streams. I know the cuDNN RNN is capable of doing kernels in parallel but I cannot find the source code for how to do this. It seems the only usable stream is provided by the OpKernelContext. Matrix multiplication must have a similar parallelism when doing the backward pass but I have not found anywhere in the source making use of multiple streams. This is essentially a reopening of [this issue](https://github.com/tensorflow/tensorflow/issues/6675) as there doesn't appear to a public answer

dl_xiaocaiji · November 24, 2021, 7:44pm

Thanks for reply. I am wondering if there is an efficient way to thoroughly understand TensorFlow at source-code level, e.g., if there exists a graph that explains the relation among classes in TF. I would probably need to follow up with tensorflow implementation to consider changes at architecture level from old versions to new ones.

Bhack · November 24, 2021, 9:53pm

You can visualize some dependencies graphs between sub-components with:

More in general, for the internals, we have already a thread at:

https://tensorflow-prod.ospodiscourse.com/t/how-to-dive-deep-into-tensorflow-internals/4250

You can try to post there if you are looking for something more.