1. Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning
2. Communication-optimal parallel algorithm for strassen's matrix multiplication
3. Neil Band . 2020 . MemFlow: Memory-Aware Distributed Deep Learning . In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2883--2885 . Neil Band. 2020. MemFlow: Memory-Aware Distributed Deep Learning. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2883--2885.
4. Zhenkun Cai , Xiao Yan , Yidi Wu , Kaihao Ma , James Cheng , and Fan Yu . 2021 . DGCL: An efficient communication library for distributed GNN training . In Proceedings of the Sixteenth European Conference on Computer Systems. 130--144 . Zhenkun Cai, Xiao Yan, Yidi Wu, Kaihao Ma, James Cheng, and Fan Yu. 2021. DGCL: An efficient communication library for distributed GNN training. In Proceedings of the Sixteenth European Conference on Computer Systems. 130--144.
5. Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication