When Massive GPU Parallelism Ain’t Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network-Reference-Cited by-同舟云学术

When Massive GPU Parallelism Ain’t Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network

Published:2022-03-31 Issue:1 Volume:15 Page:1-35
ISSN:1936-7406
Container-title:ACM Transactions on Reconfigurable Technology and Systems
language:en
Short-container-title:ACM Trans. Reconfigurable Technol. Syst.

Author:

Rybalkin Vladimir¹,Ney Jonas¹,Tekleyohannes Menbere Kina¹,Wehn Norbert¹

Affiliation:

1. Microelectronic Systems Design Research Group, University of Kaiserslautern, Kaiserslautern, Germany

Abstract

Multidimensional Long Short-Term Memory (MD-LSTM) neural network is an extension of one-dimensional LSTM for data with more than one dimension. MD-LSTM achieves state-of-the-art results in various applications, including handwritten text recognition, medical imaging, and many more. However, its implementation suffers from the inherently sequential execution that tremendously slows down both training and inference compared to other neural networks. The main goal of the current research is to provide acceleration for inference of MD-LSTM. We advocate that Field-Programmable Gate Array (FPGA) is an alternative platform for deep learning that can offer a solution when the massive parallelism of GPUs does not provide the necessary performance required by the application. In this article, we present the first hardware architecture for MD-LSTM. We conduct a systematic exploration to analyze a tradeoff between precision and accuracy. We use a challenging dataset for semantic segmentation, namely historical document image binarization from the DIBCO 2017 contest and a well-known MNIST dataset for handwritten digit recognition. Based on our new architecture, we implement FPGA-based accelerators that outperform Nvidia Geforce RTX 2080 Ti with respect to throughput by up to 9.9

and Nvidia Jetson AGX Xavier with respect to energy efficiency by up to 48

. Our accelerators achieve higher throughput, energy efficiency, and resource efficiency than FPGA-based implementations of convolutional neural networks (CNNs) for semantic segmentation tasks. For the handwritten digit recognition task, our FPGA implementations provide higher accuracy and can be considered as a solution when accuracy is a priority. Furthermore, they outperform earlier FPGA implementations of one-dimensional LSTMs with respect to throughput, energy efficiency, and resource efficiency.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3469661

Reference77 articles.

1. Are 2d-lstm really dead for offline text recognition?;Moysset Bastien;International Journal on Document Analysis and Recognition (IJDAR),2019

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A job scheduling algorithm based on parallel workload prediction on computational grid;Journal of Parallel and Distributed Computing;2023-01