PIMSAB: A P rocessing- I n- M emory System with S patially- A ware Communication and B it-Serial-Aware Computation

Author:

Ma Siyuan1ORCID,Mhatre Kaustubh2ORCID,Weng Jian3ORCID,Hanindhito Bagus4ORCID,Wang Zhengrong5ORCID,Nowatzki Tony5ORCID,John Lizy4ORCID,Arora Aman2ORCID

Affiliation:

1. Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, United States

2. Arizona State University, Tempe, United States

3. King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

4. The University of Texas at Austin, Austin, United States

5. University of California, Los Angeles, Los Angeles, United States

Abstract

Bit-serial Processing-In-Memory (PIM) is an attractive paradigm for accelerator architectures, for parallel workloads such as Deep Learning (DL), because of its capability to achieve massive data parallelism at a low area overhead and provide orders-of-magnitude data movement savings by moving computational resources closer to the data. While many PIM architectures have been proposed, improvements are needed in communicating intermediate results to consumer kernels, for communication between tiles at scale, for reduction operations, and for efficiently performing bit-serial operations with constants. We present PIMSAB, a scalable architecture that provides a spatially aware communication network for efficient intra-tile and inter-tile data movement and provides efficient computation support for generally inefficient bit-serial compute patterns. Our architecture consists of a massive hierarchical array of compute-enabled SRAMs (CRAMs), which is codesigned with a compiler to achieve high utilization. The key novelties of our architecture are (1) in providing efficient support for spatially-aware communication by providing local H-tree network for reductions, by adding explicit hardware for shuffling operands, and by deploying systolic broadcasting, as well as (2) by taking advantage of the divisible nature of bit-serial computations through adaptive precision and efficient handling of constant operations. These innovations are integrated into a tensor expressions-based programming framework (including a compiler for easy programmability) that enables simple programmer control of optimizations for mapping programs into massively parallel binaries for millions of PIM processing elements. When compared against a similarly provisioned modern Tensor Core GPU (NVIDIA A100), across common DL kernels and end-to-end DL networks (Resnet18 and BERT), PIMSAB outperforms the GPU by 4.80 ×, and reduces energy by 3.76 ×. We compare PIMSAB with similarly provisioned state-of-the-art SRAM PIM (Duality Cache) and DRAM PIM (SIMDRAM), and observe a speedup of 3.7 × and 3.88 × respectively.

Publisher

Association for Computing Machinery (ACM)

Reference45 articles.

1. Compute Caches

2. Routing architectures for hierarchical field programmable gate arrays

3. Khalid Al-Hawaj, Olalekan Afuye, Shady Agwa, Alyssa Apsel, and Christopher Batten. 2020. Towards a Reconfigurable Bit-Serial/Bit-Parallel Vector Accelerator using In-Situ Processing-In-SRAM. In 2020 IEEE International Symposium on Circuits and Systems (ISCAS). 1–5.

4. PUMA

5. CoMeFa: Compute-in-Memory Blocks for FPGAs

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3