Benefits of Adding Hardware Support for Broadcast and Reduce Operations in MPSoC Applications

Author:

Peng Yuanxi1,Saldaña Manuel2,Madill Christopher A.2,Zou Xiaofeng1,Chow Paul3

Affiliation:

1. National University of Defense Technology, Hunan, P. R. China

2. ArchES Computing Systems, Toronto, ON, Canada

3. University of Toronto, ON, Canada

Abstract

MPI has been used as a parallel programming model for supercomputers and clusters and recently in MultiProcessor Systems-on-Chip (MPSoC). One component of MPI is collective communication and its performance is key for certain parallel applications to achieve good speedups. Previous work showed that, with synthetic communication-only benchmarks, communication improvements of up to 11.4-fold and 22-fold for broadcast and reduce operations, respectively, can be achieved by providing hardware support at the network level in a Network-on-Chip (NoC). However, these numbers do not provide a good estimation of the advantage for actual applications, as there are other factors that affect performance besides communications, such as computation. To this end, we extend our previous work by evaluating the impact of hardware support over a set of five parallel application kernels of varying computation-to-communication ratios. By introducing some useful computation to the performance evaluation, we obtain more representative results of the benefits of adding hardware support for broadcast and reduce operations. The experiments show that applications with lower computation-to-communication ratios benefit the most from hardware support as they highly depend on efficient collective communications to achieve better scalability. We also extend our work by doing more analysis on clock frequency, resource usage, power, and energy. The results show reasonable scalability for resource utilization and power in the network interfaces as the number of channels increases and that, even though more power is dissipated in the network interfaces due to the added hardware, the total energy used can still be less if the actual speedup is sufficient. The application kernels are executed in a 24-embedded-processor system distributed across four FPGAs.

Funder

Aeronautical Science Foundation of China

Key Project of NUDT

Natural Sciences and Engineering Research Council of Canada

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference20 articles.

1. On the Synthesis of Sample Volumes for Real-Time Spectral Doppler Ultrasound Simulation

2. Efficient high performance collective communication for the cell blade

3. M. P. Allen and D. J. Tildesley. 1987. Computer Simulation of Liquids. Clarendon Press New York. M. P. Allen and D. J. Tildesley. 1987. Computer Simulation of Liquids. Clarendon Press New York.

4. Optimization of MPI collective communication on BlueGene/L systems

5. Global combine on mesh architectures with wormhole routing

Cited by 4 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Collective Communication on FPGA Clusters with Static Scheduling;ACM SIGARCH Computer Architecture News;2017-01-11

2. Finding Space-Time Stream Permutations for Minimum Memory and Latency;2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM);2016-05

3. CORDIC-Based Enhanced Systolic Array Architecture for QR Decomposition;ACM Transactions on Reconfigurable Technology and Systems;2016-02-03

4. An Enhanced Adaptive Recoding Rotation CORDIC;ACM Transactions on Reconfigurable Technology and Systems;2015-11-24

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3