Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous Processing-Reference-Cited by-同舟云学术

Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous Processing

Published:2023-11-30 Issue:1-4 Volume:41 Page:1-30
ISSN:0734-2071
Container-title:ACM Transactions on Computer Systems
language:en
Short-container-title:ACM Trans. Comput. Syst.

Author:

Pellauer Michael¹^ORCID,Clemons Jason¹^ORCID,Balaji Vignesh¹^ORCID,Crago Neal¹^ORCID,Jaleel Aamer¹^ORCID,Lee Donghyuk¹^ORCID,O’Connor Mike¹^ORCID,Parashar Anghsuman¹^ORCID,Treichler Sean¹^ORCID,Tsai Po-An¹^ORCID,Keckler Stephen W.¹^ORCID,Emer Joel S.¹^ORCID

Affiliation:

1. NVIDIA, USA

Abstract

Sparse tensor algorithms are becoming widespread, particularly in the domains of deep learning, graph and data analytics, and scientific computing. Current high-performance broad-domain architectures, such as GPUs, often suffer memory system inefficiencies by moving too much data or moving it too far through the memory hierarchy. To increase performance and efficiency, proposed domain-specific accelerators tailor their architectures to the data needs of a narrow application domain, but as a result cannot be applied to a wide range of algorithms or applications that contain a mix of sparse and dense algorithms. This article proposes Symphony, a hybrid programmable/specialized architecture that focuses on the orchestration of data throughout the memory hierarchy to simultaneously reduce the movement of unnecessary data and data movement distances. Key elements of the Symphony architecture include (1) specialized reconfigurable units aimed not only at roofline floating-point computations but also at supporting data orchestration features, such as address generation, data filtering, and sparse metadata processing; and (2) distribution of computation resources (both programmable and specialized) throughout the on-chip memory hierarchy. We demonstrate that Symphony can match non-programmable ASIC performance on sparse tensor algebra and provide 31× improved runtime and 44× improved energy over a comparably provisioned GPU for these applications.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3630007

Reference60 articles.

1. Chronos: Efficient Speculative Parallelism for Accelerators

2. A scalable processing-in-memory accelerator for parallel graph processing

3. On the representation and multiplication of hypersparse matrices

4. HELIX-RC

5. Theory of latency-insensitive design

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Mind the Gap: Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29

2. WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization;2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA);2024-03-02