An Instruction-Driven Batch-Based High-Performance Resource-Efficient LSTM Accelerator on FPGA
-
Published:2023-04-05
Issue:7
Volume:12
Page:1731
-
ISSN:2079-9292
-
Container-title:Electronics
-
language:en
-
Short-container-title:Electronics
Author:
Mao Ning12ORCID, Yang Haigang34ORCID, Huang Zhihong12ORCID
Affiliation:
1. Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China 2. School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100094, China 3. School of Integrated Circuits, University of Chinese Academy of Sciences, Beijing 100049, China 4. Shandong Industrial Institute of Integrated Circuits Technology Ltd., Jinan 250001, China
Abstract
In recent years, long short-term memory (LSTM) has been used in many speech recognition tasks, due to its excellent performance. Due to a large amount of calculation and complex data dependencies of LSTM, it is often not so efficient to deploy on the field-programmable gate array (FPGA) platform. This paper proposes an LSTM accelerator, driven by a specific instruction set. The accelerator consists of a matrix multiplication unit and a post-processing unit. The matrix multiplication unit uses staggered timing of read data to reduce register usage. The post-processing unit can complete various calculations with only a small amount of digital signal processing (DSP) slices, through resource sharing, and at the same time, the memory footprint is reduced, through the well-designed data flow design. The accelerator is batch-based and capable of computing data from multiple users simultaneously. Since the calculation process of LSTM is divided into a sequence of instructions, it is feasible to execute multi-layer LSTM networks as well as large-scale LSTM networks. Experimental results show that our accelerator can achieve a performance of 2036 GOPS at 16-bit data precision, while having higher hardware utilization compared to previous work.
Funder
National Natural Science Foundation of China
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering
Reference25 articles.
1. Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., and Chen, G. (2016, January 19–24). Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA. 2. Jouppi, N.P., Hyun Yoon, D., Ashcraft, M., Gottscho, M., Jablin, T.B., Kurian, G., Laudon, J., Li, S., Ma, P., and Ma, X. (2021, January 14–18). Ten Lessons From Three Generations Shaped Google’s TPUv4i: Industrial Product. Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain. 3. Efficient Hardware Architectures for 1D- and MD-LSTM Networks;Rybalkin;J. Signal Process. Syst.,2020 4. Mapping Large LSTMs to FPGAs with Weight Reuse;Que;J. Signal Process. Syst.,2020 5. Azari, E., and Vrudhula, S. (2019, January 9–12). An Energy-Efficient Reconfigurable LSTM Accelerator for Natural Language Processing. Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|