Optimizing CNN Hardware Acceleration with Configurable Vector Units and Feature Layout Strategies
-
Published:2024-03-12
Issue:6
Volume:13
Page:1050
-
ISSN:2079-9292
-
Container-title:Electronics
-
language:en
-
Short-container-title:Electronics
Author:
He Jinzhong123ORCID, Zhang Ming123ORCID, Xu Jian123, Yu Lina123, Li Weijun123
Affiliation:
1. AnnLab, Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China 2. College of Materials Science and Opto-Electronic Technology & School of Integrated Circuits, University of Chinese Academy of Sciences, Beijing 100049, China 3. Beijing Key Laboratory of Semiconductor Neural Network Intelligent Sensing and Computing Technology, Beijing 100083, China
Abstract
Convolutional neural network (CNN) hardware acceleration is critical to improve the performance and facilitate the deployment of CNNs in edge applications. Due to its efficiency and simplicity, channel group parallelism has become a popular method for CNN hardware acceleration. However, when processing data involving small channels, there will be a mismatch between feature data and computing units, resulting in a low utilization of the computing units. When processing the middle layer of the convolutional neural network, the mismatch between the feature-usage order and the feature-loading order leads to a low input feature cache hit rate. To address these challenges, this paper proposes an innovative method inspired by data reordering technology, aiming to achieve CNN hardware acceleration that reuses the same multiplier resources. This method focuses on transforming the hardware acceleration process into feature organization, feature block scheduling and allocation, and feature calculation subtasks to ensure the efficient mapping of continuous loading and the calculation of feature data. Specifically, this paper introduces a convolutional algorithm mapping strategy and a configurable vector operation unit to enhance multiplier utilization for different feature map sizes and channel numbers. In addition, an off-chip address mapping and on-chip cache management mechanism is proposed to effectively improve the feature access efficiency and on-chip feature cache hit rate. Furthermore, a configurable feature block scheduling policy is proposed to strike a balance between weight reuse and feature writeback pressure. Experimental results demonstrate the effectiveness of this method. When using 512 multipliers and accelerating VGG16 at 100 MHz, the actual computing performance reaches 102.3 giga operations per second (GOPS). Compared with other CNN hardware acceleration methods, the average computing array utilization is as high as 99.88% and the computing density is higher.
Funder
Key-Area Research and Development Program of Guangdong Province
Reference33 articles.
1. ImageNet classification with deep convolutional neural networks;Krizhevsky;Commun. ACM,2017 2. Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014, January 23–28). Rich feature hierarchies for accurate object detection and semantic segmentation tech report. Proceedings of the IEEE 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA. 3. Stereo matching by training a convolutional neural network to compare image patches;Bontar;J. Mach. Learn. Res.,2016 4. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. (2018, January 18–23). MobileNetV2: Inverted Residuals and Linear Bottlenecks. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA. 5. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., and Keutzer, K. (2016). Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5 mb model size. arXiv.
|
|