Affiliation:
1. School of Engineering Science, Simon Fraser University, Canada
2. Computer Science Department, University of California, Los Angeles, United States
Abstract
In recent years, the adoption of FPGAs in datacenters has increased, with a growing number of users choosing High-Level Synthesis (HLS) as their preferred programming method. While HLS simplifies FPGA programming, one notable challenge arises when scaling up designs for modern datacenter FPGAs that comprise multiple dies. The extra delays introduced due to die crossings and routing congestion can significantly degrade the frequency of large designs on these FPGA boards. Due to the gap between HLS design and physical design, it is challenging for HLS programmers to analyze and identify the root causes, and fix their HLS design to achieve better timing closure. Recent efforts have aimed to address these issues by employing coarse-grained floorplanning and pipelining strategies on task-parallel HLS designs where multiple tasks run concurrently and communicate through FIFO stream channels. However, many applications are not streaming friendly and many existing accelerator designs heavily rely on buffer channel based communication between tasks.
In this work, we take a step further to support a task-parallel programming model where tasks can communicate via both FIFO stream channels and buffer channels. To achieve this goal, we design and implement the PASTA framework, which takes a large task-parallel HLS design as input and automatically generates a high-frequency FPGA accelerator via HLS and physical design co-optimization. Our framework introduces a latency-insensitive buffer channel design, which supports memory partitioning and ping-pong buffering while remaining compatible with vendor HLS tools. On the frontend, we provide an easy-to-use programming model for utilizing the proposed buffer channel; while on the backend, we implement efficient placement and pipelining strategies for the proposed buffer channel. To validate the effectiveness of our framework, we test it on four widely used Rodinia HLS benchmarks and two real-world accelerator designs and show an average frequency improvement of 25%, with peak improvements of up to 89% on AMD/Xilinx Alveo U280 boards compared to Vitis HLS baselines.
Publisher
Association for Computing Machinery (ACM)
Reference47 articles.
1. 2020. Alveo U280 Data Center Accelerator Cards Data Sheet. https://www.xilinx.com/support/documentation/data_sheets/ds963-u280.pdfLast accessed September 12, 2020.
2. 2023. Altera Gives Intel a Hot Hand in Programmable Chips. https://fortune.com/2015/12/28/intel-completes-altera-acquisition/. [Accessed 01-11-2023].
3. 2023. Amazon EC2 F1 Instances. https://aws.amazon.com/ec2/instance-types/f1/. [Accessed 01-11-2023].
4. 2023. AMD Adaptive Computing Documentation Portal: SSI Technology Considerations. https://docs.xilinx.com/r/en-US/ug949-vivado-design-methodology/SSI-Technology-Considerations. [Accessed 01-11-2023].
5. 2023. AMD Adaptive Computing Documentation Portal: Ultrascale Memory Resources. https://docs.xilinx.com/v/u/en-US/ug573-ultrascale-memory-resources. [Accessed 03-11-2023].