Affiliation:
1. School of Computer Science Hubei University of Technology Wuhan China
2. School of Artificial Intelligence Hubei Business College Wuhan China
3. School of Information Management Hubei University of Economics Wuhan China
4. Department of Mathematics and Computer Science Northeastern State University Tahlequah Oklahoma USA
Abstract
AbstractDelta compression, as a complementary technique for data deduplication, has gained widespread attention in network storage systems. It can eliminate redundant data between non‐duplicate but similar chunks that cannot be identified by data deduplication. The network transmission overhead between servers and clients can be greatly reduced by using data deduplication and delta compression techniques. Resemblance detection is a technique that identifies similar chunks for post‐deduplication delta compression in network storage systems. The existing resemblance detection approaches fail to detect similar chunks with arbitrary similarity by setting a similarity threshold, which can be suboptimal. In this paper, the authors propose Chunk2vec, a resemblance detection scheme for delta compression that utilizes deep learning techniques and Approximate Nearest Neighbour Search technique to detect similar chunks with any given similarity range. Chunk2vec uses a deep neural network, Sentence‐BERT, to extract an approximate feature vector for each chunk while preserving its similarity with other chunks. The experimental results on five real‐world datasets indicate that Chunk2vec improves the accuracy of resemblance detection for delta compression and achieves higher compression ratio than the state‐of‐the‐art resemblance detection technique.
Publisher
Institution of Engineering and Technology (IET)
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Deduplication-Aware Healthcare Data Distribution in IoMT;Mathematics;2024-08-11
2. Is Low Similarity Threshold A Bad Idea in Delta Compression?;Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems;2024-07-08