Memory- and Communication-Aware Model Compression for Distributed Deep Learning Inference on IoT-Reference-Cited by-同舟云学术

Memory- and Communication-Aware Model Compression for Distributed Deep Learning Inference on IoT

Published:2019-10-19 Issue:5s Volume:18 Page:1-22
ISSN:1539-9087
Container-title:ACM Transactions on Embedded Computing Systems
language:en
Short-container-title:ACM Trans. Embed. Comput. Syst.

Author:

Bhardwaj Kartikeya¹,Lin Ching-Yi¹,Sartor Anderson¹,Marculescu Radu¹

Affiliation:

1. Carnegie Mellon University, Pittsburgh, PA, USA

Abstract

Model compression has emerged as an important area of research for deploying deep learning models on Internet-of-Things (IoT). However, for extremely memory-constrained scenarios, even the compressed models cannot fit within the memory of a single device and, as a result, must be distributed across multiple devices. This leads to a distributed inference paradigm in which memory and communication costs represent a major bottleneck. Yet, existing model compression techniques are not communication-aware. Therefore, we propose Network of Neural Networks (NoNN), a new distributed IoT learning paradigm that compresses a large pretrained ‘teacher’ deep network into several disjoint and highly-compressed ‘student’ modules, without loss of accuracy. Moreover, we propose a network science-based knowledge partitioning algorithm for the teacher model, and then train individual students on the resulting disjoint partitions. Extensive experimentation on five image classification datasets, for user-defined memory/performance budgets, show that NoNN achieves higher accuracy than several baselines and similar accuracy as the teacher model, while using minimal communication among students. Finally, as a case study, we deploy the proposed model for CIFAR-10 dataset on edge devices and demonstrate significant improvements in memory footprint (up to 24×), performance (up to 12×), and energy per node (up to 14×) compared to the large teacher model. We further show that for distributed inference on multiple edge devices, our proposed NoNN model results in up to 33× reduction in total latency w.r.t. a state-of-the-art model compression baseline.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3358205

Reference28 articles.

1. Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems. 2654--266 Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems. 2654--266

2. Hongyang Gao Zhengyang Wang and Shuiwang Ji. 2018. ChannelNets: Compact and efficient convolutional neural networks via channel-wise convolutions. In Advances in Neural Information Processing Systems. 5203--5211. Hongyang Gao Zhengyang Wang and Shuiwang Ji. 2018. ChannelNets: Compact and efficient convolutional neural networks via channel-wise convolutions. In Advances in Neural Information Processing Systems. 5203--5211.

3. Song Han Huizi Mao and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning trained quantization and Huffman coding. arXiv:1510.00149 (2015). Song Han Huizi Mao and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning trained quantization and Huffman coding. arXiv:1510.00149 (2015).

4. Song Han Jeff Pool John Tran and William Dally. 2015. Learning both weights and connections for efficient neural network. In NIPS. 1135--1143. Song Han Jeff Pool John Tran and William Dally. 2015. Learning both weights and connections for efficient neural network. In NIPS. 1135--1143.

Cited by 35 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. ResPrune: An energy-efficient restorative filter pruning method using stochastic optimization for accelerating CNN;Pattern Recognition;2024-11

2. A comprehensive review of model compression techniques in machine learning;Applied Intelligence;2024-09-02

3. Optimizing code allocation for hybrid on-chip memory in IoT systems;Integration;2024-07

4. BPS: Batching, Pipelining, Surgeon of Continuous Deep Inference on Collaborative Edge Intelligence;IEEE Transactions on Cloud Computing;2024-07

5. A unified privacy preserving model with AI at the edge for Human-in-the-Loop Cyber-Physical Systems;Internet of Things;2024-04