Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators-Reference-Cited by-同舟云学术

Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators

Published:2023-03 Issue:2 Volume:20 Page:1-26
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Liu Qiaoyi¹^ORCID,Setter Jeff¹^ORCID,Huff Dillon¹^ORCID,Strange Maxwell¹^ORCID,Feng Kathleen¹^ORCID,Horowitz Mark¹^ORCID,Raina Priyanka¹^ORCID,Kjolstad Fredrik¹^ORCID

Affiliation:

1. Stanford University, Stanford, CA, USA

Abstract

Image processing and machine learning applications benefit tremendously from hardware acceleration. Existing compilers target either FPGAs, which sacrifice power and performance for programmability, or ASICs, which become obsolete as applications change. Programmable domain-specific accelerators, such as coarse-grained reconfigurable arrays (CGRAs), have emerged as a promising middle-ground, but they have traditionally been difficult compiler targets since they use a different memory abstraction. In contrast to CPUs and GPUs, the memory hierarchies of domain-specific accelerators use push memories : memories that send input data streams to computation kernels or to higher or lower levels in the memory hierarchy and store the resulting output data streams. To address the compilation challenge caused by push memories, we propose that the representation of these memories in the compiler be altered to directly represent them by combining storage with address generation and control logic in a single structure—a unified buffer. The unified buffer abstraction enables the compiler to separate generic push memory optimizations from the mapping to specific memory implementations in the backend. This separation allows our compiler to map high-level Halide applications to different CGRA memory designs, including some with a ready-valid interface. The separation also opens the opportunity for optimizing push memory elements on reconfigurable arrays. Our optimized memory implementation, the Physical Unified Buffer, uses a wide-fetch, single-port SRAM macro with built-in address generation logic to implement a buffer with two read and two write ports. It is 18% smaller and consumes 31% less energy than a physical buffer implementation using a dual-port memory that only supports two ports. Finally, our system evaluation shows that enabling a compiler to support CGRAs leads to performance and energy benefits. Over a wide range of image processing and machine learning applications, our CGRA achieves 4.7× better runtime and 3.5× better energy-efficiency compared to an FPGA.

Funder

DARPA’s DSSoC

Stanford AHA Center

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3572908

Reference57 articles.

1. Learning to optimize halide with tree search and random programs

2. BilRC: An Execution Triggered Coarse Grained Reconfigurable Architecture

3. A practical automatic polyhedral parallelizer and locality optimizer

4. Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’11). Association for Computing Machinery, New York, NY, 33–36.

5. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Accelerated Inference for Thyroid Nodule Recognition in Ultrasound Imaging Using FPGA;2024-08-16

2. PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs;ACM Transactions on Reconfigurable Technology and Systems;2024-08-05

3. HIDA: A Hierarchical Dataflow Compiler for High-Level Synthesis;Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1;2024-04-17

4. SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for Halide;IEEE Access;2024

5. Hardware Design of Lightweight Binary Classification Algorithms for Small-Size Images on FPGA;IEEE Access;2024