Hybrid Precision Floating-Point (HPFP) Selection to Optimize Hardware-Constrained Accelerator for CNN Training
Author:
Junaid Muhammad1ORCID, Aliev Hayotjon1, Park SangBo1, Kim HyungWon1ORCID, Yoo Hoyoung2ORCID, Sim Sanghoon1
Affiliation:
1. Department of Electronics, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, Republic of Korea 2. Department of Electronics, Chungnam National University, Daejeon 34134, Republic of Korea
Abstract
The rapid advancement in AI requires efficient accelerators for training on edge devices, which often face challenges related to the high hardware costs of floating-point arithmetic operations. To tackle these problems, efficient floating-point formats inspired by block floating-point (BFP), such as Microsoft Floating Point (MSFP) and FlexBlock (FB), are emerging. However, they have limited dynamic range and precision for the smaller magnitude values within a block due to the shared exponent. This limits the BFP’s ability to train deep neural networks (DNNs) with diverse datasets. This paper introduces the hybrid precision (HPFP) selection algorithms, designed to systematically reduce precision and implement hybrid precision strategies, thereby balancing layer-wise arithmetic operations and data path precision to address the shortcomings of traditional floating-point formats. Reducing the data bit width with HPFP allows more read/write operations from memory per cycle, thereby decreasing off-chip data access and the size of on-chip memories. Unlike traditional reduced precision formats that use BFP for calculating partial sums and accumulating those partial sums in 32-bit Floating Point (FP32), HPFP leads to significant hardware savings by performing all multiply and accumulate operations in reduced floating-point format. For evaluation, two training accelerators for the YOLOv2-Tiny model were developed, employing distinct mixed precision strategies, and their performance was benchmarked against an accelerator utilizing a conventional brain floating point of 16 bits (Bfloat16). The HPFP selection, employing 10 bits for the data path of all layers and for the arithmetic of layers requiring low precision, along with 12 bits for layers requiring higher precision, results in a 49.4% reduction in energy consumption and a 37.5% decrease in memory access. This is achieved with only a marginal mean Average Precision (mAP) degradation of 0.8% when compared to an accelerator based on Bfloat16. This comparison demonstrates that the proposed accelerator based on HPFP can be an efficient approach to designing compact and low-power accelerators without sacrificing accuracy.
Funder
National Research Foundation of Korea Korea government Ministry of Science and ICT
Reference43 articles.
1. Mohamed, K.S. (2020). Neuromorphic Computing and Beyond: Parallel, Approximation, near Memory, and Quantum, Springer Nature. 2. Serving dnns in real time at datacenter scale with project brainwave;Chung;IEEE Micro,2018 3. Krichen, M. (2023). Convolutional Neural Networks: A Survey. Computers, 12. 4. Naved, M., Devi, V.A., Gaur, L., and Elngar, A.A. (2023). IoT-Enabled Convolutional Neural Networks: Techniques and Applications, River Publishers. [1st ed.]. 5. Jang, J.-W., Lee, S., Kim, D., Park, H., Ardestani, A.S., Choi, Y., Kim, C., Kim, Y., Yu, H., and Abdel-Aziz, H. (2021, January 14–18). Sparsity-Aware and Re-configurable NPU Architecture for Samsung Flagship Mobile SoC. Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|