Efficient Low-Memory Implementation of Sparse CNNs Using Encoded Partitioned Hybrid Sparse Format

Author:

Basak Barnali1ORCID,Dasgupta Pallab2ORCID,Pal Arpan3ORCID

Affiliation:

1. TCS Research, Tata Consultancy Services Ltd Kolkata, Kolkata, India

2. Computer Science & Engineering, Indian Institute of Technology Kharagpur, Kharagpur, India

3. Innovation Lab, Tata Consultancy Services Ltd., Kolkata, India

Abstract

Certain data compression techniques like pruning leads to unstructured sparse Convolution Neural Network (CNN) models without directly leveraging sparsity in optimizing both memory consumption and inference latency of a model having low to medium sparsity. State-of-the-art storage techniques either optimize model size at the cost of execution latency or optimize inference latency at the overhead of the memory consumption of the model. This tradeoff is largely due to the absence of storage selection methodology addressing sparsity sensitivity , arising from varied sparsity and positions of nonzero values called sparsity structure across different sparse layers of a model. However, this issue remains unexplored due to the lack of support to handle sparse data in the current deployment standards for edge devices. This article introduces a data compaction strategy for unstructured sparse data that not only compresses nonzero data but also encodes it, leveraging the memory consumption and latency reduction benefits of both data compression and data encoding techniques . We propose a novel storage representation, named Encoded Partitioned Hybrid Sparse (EPaHS) format, which addresses sparsity sensitivity by customizing data storage based on the sparsity structure of the data. Our data compaction technique and storage solution optimizes the tradeoff between the memory consumption and inference latency of a sparse model without altering the network architecture and affecting its accuracy. Our solution easily extends to higher-dimensional data and outperforms standard storage solutions. It proves to be beneficial to all the valid mode orientations of multi-dimensional data. For an important health and wellness application, a single-lead short-time ECG classification model, EPaHS achieves up to \({\tt 16.18\%}\) reduction in size and \({\tt 15.16\%}\) reduction in latency when compared to its original model of \({\tt 42}\) MB size and \({\tt 26.35}\) sec latency, having \({\tt \approx 59\%}\) sparsity. For a ResNet50 model handling higher-dimensional data, it achieves \({\tt 21.33\%}\) size reduction and \({\tt 53.9\%}\) latency gain against the original model of \({\tt 3265}\) KB size and \({\tt 1.7}\) sec latency, having \({\tt \approx 67\%}\) sparsity.

Publisher

Association for Computing Machinery (ACM)

Reference27 articles.

1. Structured Pruning of Deep Convolutional Neural Networks

2. ONNX: Open Neural Network Exchange;Bai Junjie;Leiden University,2023

3. Aart J. C. Bik. 1996. Compiler Support for Sparse Matrix Computations. Ph. D. Dissertation. Leiden University.

4. Robert David Jared Duke Advait Jain Vijay Janapa Reddi Nat Jeffries Jian Li Nick Kreeger Ian Nappier Meghna Natraj Shlomi Regev Rocky Rhodes Tiezhen Wang and Pete Warden. 2021. TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems. Retrieved from https://arxiv.org/abs/2010.08678

5. William Fedus Barret Zoph and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Retrieved from https://arxiv.org/abs/2101.03961

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3