PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs

Author:

Khatti Moazin1ORCID,Tian Xingyu1ORCID,Sedigh Baroughi Ahmad1ORCID,Raj Baranwal Akhil1ORCID,Chi Yuze2ORCID,Guo Licheng2ORCID,Cong Jason2ORCID,Fang Zhenman1ORCID

Affiliation:

1. School of Engineering Science, Simon Fraser University, Canada

2. Computer Science Department, University of California, Los Angeles, United States

Abstract

In recent years, the adoption of FPGAs in datacenters has increased, with a growing number of users choosing High-Level Synthesis (HLS) as their preferred programming method. While HLS simplifies FPGA programming, one notable challenge arises when scaling up designs for modern datacenter FPGAs that comprise multiple dies. The extra delays introduced due to die crossings and routing congestion can significantly degrade the frequency of large designs on these FPGA boards. Due to the gap between HLS design and physical design, it is challenging for HLS programmers to analyze and identify the root causes, and fix their HLS design to achieve better timing closure. Recent efforts have aimed to address these issues by employing coarse-grained floorplanning and pipelining strategies on task-parallel HLS designs where multiple tasks run concurrently and communicate through FIFO stream channels. However, many applications are not streaming friendly and many existing accelerator designs heavily rely on buffer channel based communication between tasks. In this work, we take a step further to support a task-parallel programming model where tasks can communicate via both FIFO stream channels and buffer channels. To achieve this goal, we design and implement the PASTA framework, which takes a large task-parallel HLS design as input and automatically generates a high-frequency FPGA accelerator via HLS and physical design co-optimization. Our framework introduces a latency-insensitive buffer channel design, which supports memory partitioning and ping-pong buffering while remaining compatible with vendor HLS tools. On the frontend, we provide an easy-to-use programming model for utilizing the proposed buffer channel; while on the backend, we implement efficient placement and pipelining strategies for the proposed buffer channel. To validate the effectiveness of our framework, we test it on four widely used Rodinia HLS benchmarks and two real-world accelerator designs and show an average frequency improvement of 25%, with peak improvements of up to 89% on AMD/Xilinx Alveo U280 boards compared to Vitis HLS baselines.

Publisher

Association for Computing Machinery (ACM)

Reference47 articles.

1. 2020. Alveo U280 Data Center Accelerator Cards Data Sheet. https://www.xilinx.com/support/documentation/data_sheets/ds963-u280.pdfLast accessed September 12, 2020.

2. 2023. Altera Gives Intel a Hot Hand in Programmable Chips. https://fortune.com/2015/12/28/intel-completes-altera-acquisition/. [Accessed 01-11-2023].

3. 2023. Amazon EC2 F1 Instances. https://aws.amazon.com/ec2/instance-types/f1/. [Accessed 01-11-2023].

4. 2023. AMD Adaptive Computing Documentation Portal: SSI Technology Considerations. https://docs.xilinx.com/r/en-US/ug949-vivado-design-methodology/SSI-Technology-Considerations. [Accessed 01-11-2023].

5. 2023. AMD Adaptive Computing Documentation Portal: Ultrascale Memory Resources. https://docs.xilinx.com/v/u/en-US/ug573-ultrascale-memory-resources. [Accessed 03-11-2023].

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3