Affiliation:
1. Institute for Quantum Information & State Key Laboratory of High Performance Computing, National University of Defense Technology, changsha, China
2. School of Computer, National University of Defense Technology, changsha, China
3. National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha, China
Abstract
As the Convolutional Neural Network (CNN) goes deeper and more complex, the network becomes memory-intensive and computation-intensive. To address this issue, the lightweight neural network reduces parameters and Multiplication-and-Accumulation (MAC) operations by using the Depthwise Separable Convolution (DSC) to improve speed and efficiency. Nonetheless, the energy efficiency of classical Von Neumann architectures for CNNs is limited due to the memory wall challenge. Spin-based architectures have the potential to address this challenge thanks to the integration of memory and computing with ultra-high energy efficiency. However, deploying the DSC on spin-based architectures with the traditional dataflow leads to huge activation movements and low hardware utilization. Moreover, the inter-layer data dependency of neural networks increases latency. These factors become the bottleneck of improving energy efficiency and performance.
Inspired by these challenges, we propose a novel dataflow on Spin-based Architectures for Lightweight neural networks (SAL). The novel dataflow replaces convolution unrolling by selecting activations in the crossbar according to the convolution window and also realizes the inter-layer data reuse. Moreover, the novel dataflow also reduces the latency due to the data dependency between layers, realizing higher performance. To the best of our knowledge, this is the first design to use hybrid dataflow for the PIM architecture. We also optimize the structure of the spin-based crossbar and the pipeline based on the dataflow to achieve better data reuse and computational parallelism. For deploying the MobileNet V1, the novel dataflow improves the hardware utilization by 23×∼ 105× and reduces the data traffic by 1.09×∼ 18.6×. Compared with the NEBULA, a spin-based non-Von Neumann architecture, the SAL reduces the energy consumption by 4× and improves the performance by 7.3×, which are 0.32
mJ
and 10.43
GOPs
-1
, respectively. Moreover, the SAL improves power efficiency over 29 times more than the NEBULA. Compared with the Eyeriss, the SAL improves the energy efficiency by four orders of magnitude.
Funder
National Key R&D
NSFC
STIP of Hunan Province
Foundation of PDL
Key Laboratory of Advanced Microprocessor Chips and Systems
Hunan Postgraduate Research Innovation Project
Publisher
Association for Computing Machinery (ACM)
Reference48 articles.
1. Xbar-Partitioning: A Practical Way for Parasitics and Noise Tolerance in Analog IMC Circuits
2. CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory-Based Neural Network Accelerators
3. Vanessa H-C Chen and Lawrence Pileggi. 2013. An 8.5 mW 5GS/s 6b flash ADC with dynamic offset calibration in 32nm CMOS SOI. In Proceedings of the 2013 Symposium on VLSI Circuits. IEEE, C264–C265.
4. Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. 609–622.
5. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks