Accelerating attention mechanism on FPGAs based on efficient reconfigurable systolic array

Author:

Ye Wenhua1,Zhou Xu2,Zhou Joey TianYi3,Chen Cen4,Li Kenli2

Affiliation:

1. College of Information Science and Engineering, Hunan University, China and China Electronics Technology Group Corporation 36th Research Institute, China

2. College of Information Science and Engineering, Hunan University, China

3. Centre for Frontier AI Research, A*STAR, Singapore

4. College of Information Science and Engineering, Hunan University, Singapore

Abstract

Transformer model architectures have recently received great interest in natural language, machine translation, and computer vision, where attention mechanisms are their building blocks. However, the attention mechanism is expensive because of its intensive matrix computations and complicated data flow. The existing hardware architecture has some disadvantages for the computing structure of attention, such as inflexibility and low efficiency. Most of the existing papers accelerate attention by reducing the amount of computation through various pruning algorithms, which will affect the results in a certain extent with different sparsity. This paper proposes the hardware accelerator for the multi-head attention (MHA) on field-programmable gate arrays (FPGAs) with reconfigurable architecture, efficient systolic array, and hardware-friendly radix-2 softmax. We propose a novel method called Four inputs Processing Element(FPE) to double the computation rate of the data-aware systolic array (SA) and make it efficient and load balance. Especially, the computation framework is well designed to ensure the utilization of SA efficiently. Our design is evaluated on a Xilinx Alveo U250 card, and the proposed architecture achieves 51.3×, 17.3× improvement in latency, and 54.4×, 17.9× energy savings compared to CPU and GPU.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Software

Reference44 articles.

1. Nagadastagiri Challapalle , Sahithi Rampalli , Makesh Chandran , Gurpreet Kalsi , Sreenivas Subramoney , John Sampson , and Vijaykrishnan Narayanan . 2020 . PSB-RNN: a processing-in-memory systolic array architecture using block circulant matrices for recurrent neural networks. In 2020 Design , Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 180–185 . Nagadastagiri Challapalle, Sahithi Rampalli, Makesh Chandran, Gurpreet Kalsi, Sreenivas Subramoney, John Sampson, and Vijaykrishnan Narayanan. 2020. PSB-RNN: a processing-in-memory systolic array architecture using block circulant matrices for recurrent neural networks. In 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 180–185.

2. Yixin Chen , Weiyi Lu , Alejandro Mottini , Li Erran Li , Jasha Droppo , Zheng Du , and Belinda Zeng . 2021 . Top-Down Attention in End-to-End Spoken Language Understanding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6199–6203 . Yixin Chen, Weiyi Lu, Alejandro Mottini, Li Erran Li, Jasha Droppo, Zheng Du, and Belinda Zeng. 2021. Top-Down Attention in End-to-End Spoken Language Understanding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6199–6203.

3. RiSA: A Reinforced Systolic Array for Depthwise Convolutions and Embedded Tensor Reshaping

4. Zihang Dai , Zhilin Yang , Yiming Yang , Jaime Carbonell , Quoc  V Le , and Ruslan Salakhutdinov . 2019 . Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860(2019). Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860(2019).

5. Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).

Cited by 9 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error Detection;ACM Transactions on Architecture and Code Optimization;2024-09-14

2. BlissCam: Boosting Eye Tracking Efficiency with Learned In-Sensor Sparse Sampling;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29

3. Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators through Attention Fusion;Proceedings of the Great Lakes Symposium on VLSI 2024;2024-06-12

4. A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference;2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW);2024-05-27

5. MaxEVA: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine;2023 International Conference on Field Programmable Technology (ICFPT);2023-12-12

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3