Integrating Distributed Matrix Multiplication into TensorFlow

Would it be possible to integrate a distributed matrix-matrix multiplication algorithm like COSMA into TensorFlow?

COSMA is communication-optimal, gpu-accelerated algorithm for matmul, that is already used in some HPC applications with great performance results! It’s ported to both NVIDIA and AMD GPUs and can take advantage of fast gpu-to-gpu interconnects like NVLink through NCCL or GPU-aware MPI.

I am one of the main developers of COSMA and would be happy to integrate it into TensorFlow.

Let me know your thoughts on that!