HierarchicalCopyAllReduce is extremely slow

I found HierarchicalCopyAllReduce is much slower than NcclAllReduce, related https://github.com/google/automl/issues/971. Any ideas?

1 Like