I’m doing some research on Unified Memory Management on Multi-GPUs system and trying to compare the performance with explicit copy on some real ML workloads.
The benefits from Unified Memory are
- Allow memory oversubscription
- Improve programmability, programmers don’t need to worry about data placement and movement
I found there’s a switch per_process_gpu_memory_fraction to turn on Unified Memory in tensorflow. For distributed training on multi GPUs, I used tf.distribute.MirroredStrategy API. But from profiling result, it seems that tensorflow just leverage Unified Memory to overcome memory oversubscription, there are still explicit memory copies between GPU and CPU, or GPU and GPU.
I’m wondering if there’s a way to train on multi GPUs and fully explore the power of Unified Memory, like letting memory system manage the data, in tensorflow.
- TensorFlow version (you are using): 2.4
- CUDA version: 11.0
- cudnn version: 8.0