Detailed CUDA Implementation of TensorFlow

We are currently investigating how to deploy TensorFlow 2 for custom OP/deep learning on our product. We currently understand that a session is the place to execute a TensorFlow graph, which may include both deep learning OPs or self-defined (custom) OPs. To find out the best software architecture for our product, we would like to know how a tensorflow session is implemented in CUDA. Specifically, we would like to understand stream management, memory copy and compute execution in a session. In more depth, we would like to understand the implementation at OS level, e.g., running different sessions in a single process and in multiple processes. Is there a particular document for our questions from tf2?

I suppose that something it is going to change with the new Tensorflow runtime:

Thanks for reply. I am wondering if there is an efficient way to thoroughly understand TensorFlow at source-code level, e.g., if there exists a graph that explains the relation among classes in TF. I would probably need to follow up with tensorflow implementation to consider changes at architecture level from old versions to new ones.

You can visualize some dependencies graphs between sub-components with:

More in general, for the internals, we have already a thread at:

You can try to post there if you are looking for something more.