Efficient Low-Memory Implementation of Sparse CNNs Using Encoded Partitioned Hybrid Sparse Format-Reference-Cited by-同舟云学术

Efficient Low-Memory Implementation of Sparse CNNs Using Encoded Partitioned Hybrid Sparse Format

Published:2024-09-11 Issue:6 Volume:23 Page:1-30
ISSN:1539-9087
Container-title:ACM Transactions on Embedded Computing Systems
language:en
Short-container-title:ACM Trans. Embed. Comput. Syst.

Author:

Basak Barnali¹^ORCID,Dasgupta Pallab²^ORCID,Pal Arpan³^ORCID

Affiliation:

1. TCS Research, Tata Consultancy Services Ltd Kolkata, Kolkata, India

2. Computer Science & Engineering, Indian Institute of Technology Kharagpur, Kharagpur, India

3. Innovation Lab, Tata Consultancy Services Ltd., Kolkata, India

Abstract

Certain data compression techniques like pruning leads to unstructured sparse Convolution Neural Network (CNN) models without directly leveraging sparsity in optimizing both memory consumption and inference latency of a model having low to medium sparsity. State-of-the-art storage techniques either optimize model size at the cost of execution latency or optimize inference latency at the overhead of the memory consumption of the model. This tradeoff is largely due to the absence of storage selection methodology addressing sparsity sensitivity , arising from varied sparsity and positions of nonzero values called sparsity structure across different sparse layers of a model. However, this issue remains unexplored due to the lack of support to handle sparse data in the current deployment standards for edge devices. This article introduces a data compaction strategy for unstructured sparse data that not only compresses nonzero data but also encodes it, leveraging the memory consumption and latency reduction benefits of both data compression and data encoding techniques . We propose a novel storage representation, named Encoded Partitioned Hybrid Sparse (EPaHS) format, which addresses sparsity sensitivity by customizing data storage based on the sparsity structure of the data. Our data compaction technique and storage solution optimizes the tradeoff between the memory consumption and inference latency of a sparse model without altering the network architecture and affecting its accuracy. Our solution easily extends to higher-dimensional data and outperforms standard storage solutions. It proves to be beneficial to all the valid mode orientations of multi-dimensional data. For an important health and wellness application, a single-lead short-time ECG classification model, EPaHS achieves up to

\({\tt 16.18\%}\)

reduction in size and

\({\tt 15.16\%}\)

reduction in latency when compared to its original model of

\({\tt 42}\)

MB size and

\({\tt 26.35}\)

sec latency, having

\({\tt \approx 59\%}\)

sparsity. For a ResNet50 model handling higher-dimensional data, it achieves

\({\tt 21.33\%}\)

size reduction and

\({\tt 53.9\%}\)

latency gain against the original model of

\({\tt 3265}\)

KB size and

\({\tt 1.7}\)

sec latency, having

\({\tt \approx 67\%}\)

sparsity.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3687239

Reference27 articles.

1. Structured Pruning of Deep Convolutional Neural Networks

2. ONNX: Open Neural Network Exchange;Bai Junjie;Leiden University,2023

3. Aart J. C. Bik. 1996. Compiler Support for Sparse Matrix Computations. Ph. D. Dissertation. Leiden University.

4. Robert David Jared Duke Advait Jain Vijay Janapa Reddi Nat Jeffries Jian Li Nick Kreeger Ian Nappier Meghna Natraj Shlomi Regev Rocky Rhodes Tiezhen Wang and Pete Warden. 2021. TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems. Retrieved from https://arxiv.org/abs/2010.08678

5. William Fedus Barret Zoph and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Retrieved from https://arxiv.org/abs/2101.03961