Efficient Hardware Architectures for 1D- and MD-LSTM Networks
-
Published:2020-07-02
Issue:11
Volume:92
Page:1219-1245
-
ISSN:1939-8018
-
Container-title:Journal of Signal Processing Systems
-
language:en
-
Short-container-title:J Sign Process Syst
Author:
Rybalkin VladimirORCID, Sudarshan Chirag, Weis Christian, Lappas Jan, Wehn Norbert, Cheng Li
Abstract
AbstractRecurrent Neural Networks, in particular One-dimensional and Multidimensional Long Short-Term Memory (1D-LSTM and MD-LSTM) have achieved state-of-the-art classification accuracy in many applications such as machine translation, image caption generation, handwritten text recognition, medical imaging and many more. However, high classification accuracy comes at high compute, storage, and memory bandwidth requirements, which make their deployment challenging, especially for energy-constrained platforms such as portable devices. In comparison to CNNs, not so many investigations exist on efficient hardware implementations for 1D-LSTM especially under energy constraints, and there is no research publication on hardware architecture for MD-LSTM. In this article, we present two novel architectures for LSTM inference: a hardware architecture for MD-LSTM, and a DRAM-based Processing-in-Memory (DRAM-PIM) hardware architecture for 1D-LSTM. We present for the first time a hardware architecture for MD-LSTM, and show a trade-off analysis for accuracy and hardware cost for various precisions. We implement the new architecture as an FPGA-based accelerator that outperforms NVIDIA K80 GPU implementation in terms of runtime by up to 84× and energy efficiency by up to 1238× for a challenging dataset for historical document image binarization from DIBCO 2017 contest, and a well known MNIST dataset for handwritten digits recognition. Our accelerator demonstrates highest accuracy and comparable throughput in comparison to state-of-the-art FPGA-based implementations of multilayer perceptron for MNIST dataset. Furthermore, we present a new DRAM-PIM architecture for 1D-LSTM targeting energy efficient compute platforms such as portable devices. The DRAM-PIM architecture integrates the computation units in a close proximity to the DRAM cells in order to maximize the data parallelism and energy efficiency. The proposed DRAM-PIM design is 16.19 × more energy efficient as compared to FPGA implementation. The total chip area overhead of this design is 18 % compared to a commodity 8 Gb DRAM chip. Our experiments show that the DRAM-PIM implementation delivers a throughput of 1309.16 GOp/s for an optical character recognition application.
Funder
H2020 Future and Emerging Technologies Stiftung Rheinland-Pfalz für Innovation
Publisher
Springer Science and Business Media LLC
Subject
Hardware and Architecture,Modelling and Simulation,Information Systems,Signal Processing,Theoretical Computer Science,Control and Systems Engineering
Reference58 articles.
1. Zynq UltraScale MPSoC Power Advantage Tool. https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18841813/Zynq+UltraScale+MPSoC+Power+Management. 2. Afzal, M.Z., Pastor-Pellicer, J., Shafait, F., Breuel, T.M., Dengel, A., & Liwicki, M. (2015). Document image binarization using lstm: a sequence learning approach. In Proceedings of the 3rd international workshop on historical document imaging and processing (pp. 79–84). ACM. 3. Agrawal, A., Jaiswal, A., Roy, D., Han, B., Srinivasan, G., Ankit, A., & Roy, K. (2019). Xcel-RAM: accelerating binary neural networks in high-throughput SRAM compute arrays. IEEE Transactions on Circuits and Systems I: Regular Papers, 66(8), 3064–3076. https://doi.org/10.1109/TCSI.2019.2907488. 4. Alemdar, H., Leroy, V., Prost-Boucle, A., & Pétrot, F. (2017). Ternary neural networks for resource-efficient ai applications. In 2017 international joint conference on neural networks (IJCNN) (pp. 2547–2554). IEEE. 5. Ando, K., Ueyoshi, K., Orimo, K., Yonekawa, H., Sato, S., Nakahara, H., Takamaeda-Yamazaki, S., Ikebe, M., Asai, T., Kuroda, T., & Motomura, M. (2018). BRein memory: a single-chip binary/ternary Reconfigurable in-memory deep neural network accelerator achieving 1.4 TOPS at 0.6 W. IEEE Journal of Solid-State Circuits, 53(4), 983–994. https://doi.org/10.1109/JSSC.2017.2778702.
Cited by
12 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|