Mult-GPUs training with Unified Memory

Yechen_Liu · December 10, 2021, 2:51am

Hi everyone,

I’m doing some research on Unified Memory Management on Multi-GPUs system and trying to compare the performance with explicit copy on some real ML workloads.

The benefits from Unified Memory are

Allow memory oversubscription
Improve programmability, programmers don’t need to worry about data placement and movement

I found there’s a switch per_process_gpu_memory_fraction to turn on Unified Memory in tensorflow. For distributed training on multi GPUs, I used tf.distribute.MirroredStrategy API. But from profiling result, it seems that tensorflow just leverage Unified Memory to overcome memory oversubscription, there are still explicit memory copies between GPU and CPU, or GPU and GPU.

I’m wondering if there’s a way to train on multi GPUs and fully explore the power of Unified Memory, like letting memory system manage the data, in tensorflow.

System information

TensorFlow version (you are using): 2.4
CUDA version: 11.0
cudnn version: 8.0

Thanks

Bhack · December 10, 2021, 1:20pm

Have you tried with these envs on TF 2.7:

github.com/tensorflow/tensorflow

[PJRT] Allow GPU memory oversubscription when unified memory is enabled.

committed 11:20AM - 09 Jul 21 UTC

tensorflower-gardener

+5 -2

With this CL, we can enable GPU memory oversubscription via env flags. For examp…le, `TF_FORCE_UNIFIED_MEMORY=1 XLA_PYTHON_CLIENT_MEM_FRACTION=8.0` provides 8x the GPU memory to the program. The 'extra' memory is physically located on the other GPU devices and the host's RAM, with swapping done transparently by CUDA. PiperOrigin-RevId: 383819164 Change-Id: Id139d3184d3a62983c1e86bf95ca4078a08db4f4