An Instruction-Driven Batch-Based High-Performance Resource-Efficient LSTM Accelerator on FPGA-Reference-Cited by-同舟云学术

An Instruction-Driven Batch-Based High-Performance Resource-Efficient LSTM Accelerator on FPGA

Published:2023-04-05 Issue:7 Volume:12 Page:1731
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Mao Ning¹²^ORCID,Yang Haigang³⁴^ORCID,Huang Zhihong¹²^ORCID

Affiliation:

1. Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

2. School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100094, China

3. School of Integrated Circuits, University of Chinese Academy of Sciences, Beijing 100049, China

4. Shandong Industrial Institute of Integrated Circuits Technology Ltd., Jinan 250001, China

Abstract

In recent years, long short-term memory (LSTM) has been used in many speech recognition tasks, due to its excellent performance. Due to a large amount of calculation and complex data dependencies of LSTM, it is often not so efficient to deploy on the field-programmable gate array (FPGA) platform. This paper proposes an LSTM accelerator, driven by a specific instruction set. The accelerator consists of a matrix multiplication unit and a post-processing unit. The matrix multiplication unit uses staggered timing of read data to reduce register usage. The post-processing unit can complete various calculations with only a small amount of digital signal processing (DSP) slices, through resource sharing, and at the same time, the memory footprint is reduced, through the well-designed data flow design. The accelerator is batch-based and capable of computing data from multiple users simultaneously. Since the calculation process of LSTM is divided into a sequence of instructions, it is feasible to execute multi-layer LSTM networks as well as large-scale LSTM networks. Experimental results show that our accelerator can achieve a performance of 2036 GOPS at 16-bit data precision, while having higher hardware utilization compared to previous work.

Funder

National Natural Science Foundation of China

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/12/7/1731/pdf

Reference25 articles.

1. Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., and Chen, G. (2016, January 19–24). Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA.

2. Jouppi, N.P., Hyun Yoon, D., Ashcraft, M., Gottscho, M., Jablin, T.B., Kurian, G., Laudon, J., Li, S., Ma, P., and Ma, X. (2021, January 14–18). Ten Lessons From Three Generations Shaped Google’s TPUv4i: Industrial Product. Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain.

3. Efficient Hardware Architectures for 1D- and MD-LSTM Networks;Rybalkin;J. Signal Process. Syst.,2020

4. Mapping Large LSTMs to FPGAs with Weight Reuse;Que;J. Signal Process. Syst.,2020

5. Azari, E., and Vrudhula, S. (2019, January 9–12). An Energy-Efficient Reconfigurable LSTM Accelerator for Natural Language Processing. Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Spiking LSTM Accelerator for Automatic Speech Recognition Application Based on FPGA;Electronics;2024-02-21