MultiWorkerMirroredStrategy

Giuseppe · December 12, 2021, 1:19pm

I have a cluster of 2 computers (the first has a RTX 2080 Ti and the second a RTX 3060). I can train a model using tf.distribute.MultiWorkerMirroredStrategy, but, contrary to what I expected, training on one GPU can be faster and, most most importantly, the batch does not seem to be split between the two GPUs, because, with a certain batch size, I can train using a single GPU, but I get an error when I use both GPUs: does anyone know why?

Tanya · January 2, 2024, 7:35pm

@Giuseppe Welcome to tensorflow forum !

It’s unusual to experience slower training and single-GPU batch limitations when using TensorFlow’s tf.distribute.MultiWorkerMirroredStrategy with two GPUs, especially across two separate machines. Here are some potential explanations and troubleshooting tips:

Possible Reasons for Slower Training:

Network Communication Overhead: Distributed training across multiple machines introduces communication overhead that can potentially outweigh the benefits of parallelization, especially for smaller batch sizes.
Resource Bottlenecks: Ensure both machines have sufficient CPU and RAM resources to handle the increased workload of distributed training. Bottlenecks in these areas can slow down the process.
Software Configuration Issues: Double-check your TensorFlow and CUDA versions, drivers, and library configurations on both machines. Outdated or incompatible versions can lead to performance issues.

Single-GPU Batch Size Limitations:

Memory Constraints: The combined memory of both GPUs might not be sufficient for the chosen batch size. Consider reducing the batch size or using techniques like gradient accumulation.
Cluster Configuration: Verify your cluster configuration settings, including the num_accelerators parameter in the strategy definition. It shouldn’t be set to zero with multiple GPUs available.

Troubleshooting Tips:

Measure Performance: Use profiling tools like TensorFlow profiler or NVIDIA Nsight to pinpoint the performance bottleneck (CPU, GPU, network communication).
Monitor Resource Usage: Track CPU, RAM, and GPU utilization on both machines during training to identify potential resource limitations.
Experiment with Batch Sizes: Try different batch sizes on both single and multi-GPU configurations to find the optimal balance between parallelization and memory/resource constraints.
Verify Cluster Setup: Carefully review your cluster configuration and address any inconsistencies in settings, hardware configuration, or software versions.
Error Analysis: If encountering errors, analyze the error messages and logs for specific clues about the cause. Sharing the error message might provide additional insights.

Distributed training across multiple machines can be more complex than single-GPU training. Consider starting with smaller batch sizes and thoroughly diagnose performance bottlenecks before scaling up.

Utilizing other distributed training strategies like tf.distribute.MirroredStrategy (single machine, multiple GPUs) can also be an option and might offer smoother performance with your current setup.

If you’re still facing difficulties, consider sharing more details like the specific error messages, hardware and software configurations, and code snippets from your training script for more targeted analysis and assistance.