HierarchicalCopyAllReduce is extremely slow

I found HierarchicalCopyAllReduce is much slower than NcclAllReduce, related issues of multi-Gpus training · Issue #971 · google/automl · GitHub. Any ideas?

1 Like