Accelerating attention mechanism on FPGAs based on efficient reconfigurable systolic array-Reference-Cited by-同舟云学术

Accelerating attention mechanism on FPGAs based on efficient reconfigurable systolic array

Published:2022-07-20 Issue: Volume: Page:
ISSN:1539-9087
Container-title:ACM Transactions on Embedded Computing Systems
language:en
Short-container-title:ACM Trans. Embed. Comput. Syst.

Author:

Ye Wenhua¹,Zhou Xu²,Zhou Joey TianYi³,Chen Cen⁴,Li Kenli²

Affiliation:

1. College of Information Science and Engineering, Hunan University, China and China Electronics Technology Group Corporation 36th Research Institute, China

2. College of Information Science and Engineering, Hunan University, China

3. Centre for Frontier AI Research, A*STAR, Singapore

4. College of Information Science and Engineering, Hunan University, Singapore

Abstract

Transformer model architectures have recently received great interest in natural language, machine translation, and computer vision, where attention mechanisms are their building blocks. However, the attention mechanism is expensive because of its intensive matrix computations and complicated data flow. The existing hardware architecture has some disadvantages for the computing structure of attention, such as inflexibility and low efficiency. Most of the existing papers accelerate attention by reducing the amount of computation through various pruning algorithms, which will affect the results in a certain extent with different sparsity. This paper proposes the hardware accelerator for the multi-head attention (MHA) on field-programmable gate arrays (FPGAs) with reconfigurable architecture, efficient systolic array, and hardware-friendly radix-2 softmax. We propose a novel method called Four inputs Processing Element(FPE) to double the computation rate of the data-aware systolic array (SA) and make it efficient and load balance. Especially, the computation framework is well designed to ensure the utilization of SA efficiently. Our design is evaluated on a Xilinx Alveo U250 card, and the proposed architecture achieves 51.3×, 17.3× improvement in latency, and 54.4×, 17.9× energy savings compared to CPU and GPU.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3549937

Reference44 articles.

1. Nagadastagiri Challapalle , Sahithi Rampalli , Makesh Chandran , Gurpreet Kalsi , Sreenivas Subramoney , John Sampson , and Vijaykrishnan Narayanan . 2020 . PSB-RNN: a processing-in-memory systolic array architecture using block circulant matrices for recurrent neural networks. In 2020 Design , Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 180–185 . Nagadastagiri Challapalle, Sahithi Rampalli, Makesh Chandran, Gurpreet Kalsi, Sreenivas Subramoney, John Sampson, and Vijaykrishnan Narayanan. 2020. PSB-RNN: a processing-in-memory systolic array architecture using block circulant matrices for recurrent neural networks. In 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 180–185.

2. Yixin Chen , Weiyi Lu , Alejandro Mottini , Li Erran Li , Jasha Droppo , Zheng Du , and Belinda Zeng . 2021 . Top-Down Attention in End-to-End Spoken Language Understanding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6199–6203 . Yixin Chen, Weiyi Lu, Alejandro Mottini, Li Erran Li, Jasha Droppo, Zheng Du, and Belinda Zeng. 2021. Top-Down Attention in End-to-End Spoken Language Understanding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6199–6203.

3. RiSA: A Reinforced Systolic Array for Depthwise Convolutions and Embedded Tensor Reshaping

4. Zihang Dai , Zhilin Yang , Yiming Yang , Jaime Carbonell , Quoc V Le , and Ruslan Salakhutdinov . 2019 . Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860(2019). Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860(2019).

5. Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error Detection;ACM Transactions on Architecture and Code Optimization;2024-09-14

2. BlissCam: Boosting Eye Tracking Efficiency with Learned In-Sensor Sparse Sampling;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29

3. Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators through Attention Fusion;Proceedings of the Great Lakes Symposium on VLSI 2024;2024-06-12

4. A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference;2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW);2024-05-27

5. MaxEVA: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine;2023 International Conference on Field Programmable Technology (ICFPT);2023-12-12