Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling

Author:

Zhang Datong1,Deng Yuhui1,Zhou Yi2,Zhu Yifeng3,Qin Xiao4

Affiliation:

1. Department of Computer Science, Jinan University, Guangzhou, Guangdong Province, China

2. TSYS School of Computer Science, Columbus State University, GA, USA

3. School of Electrical and Computer Engineering, University of Maine, Orono, ME, USA

4. Department of Computer Science and Software Engineering, Auburn University, Alabama, USA

Abstract

Data deduplication techniques construct an index consisting of fingerprint entries to identify and eliminate duplicated copies of repeating data. The bottleneck of disk-based index lookup and data fragmentation caused by eliminating duplicated chunks are two challenging issues in data deduplication. Deduplication-based backup systems generally employ containers storing contiguous chunks together with their fingerprints to preserve data locality for alleviating the two issues, which is still inadequate. To address these two issues, we propose a container utilization based hot fingerprint entry distilling strategy to improve the performance of deduplication-based backup systems. We divide the index into three parts: hot fingerprint entries, fragmented fingerprint entries, and useless fingerprint entries. A container with utilization smaller than a given threshold is called a sparse container . Fingerprint entries that point to non-sparse containers are hot fingerprint entries. For the remaining fingerprint entries, if a fingerprint entry matches any fingerprint of forthcoming backup chunks, it is classified as a fragmented fingerprint entry. Otherwise, it is classified as a useless fingerprint entry. We observe that hot fingerprint entries account for a small part of the index, whereas the remaining fingerprint entries account for the majority of the index. This intriguing observation inspires us to develop a hot fingerprint entry distilling approach named HID . HID segregates useless fingerprint entries from the index to improve memory utilization and bypass disk accesses. In addition, HID separates fragmented fingerprint entries to make a deduplication-based backup system directly rewrite fragmented chunks, thereby alleviating adverse fragmentation. Moreover, HID introduces a feature to treat fragmented chunks as unique chunks. This feature compensates for the shortcoming that a Bloom filter cannot directly identify certain duplicated chunks (i.e., the fragmented chunks). To take full advantage of the preceding feature, we propose an evolved HID strategy called EHID . EHID incorporates a Bloom filter, to which only hot fingerprints are mapped. In doing so, EHID exhibits two salient features: (i) EHID avoids disk accesses to identify unique chunks and the fragmented chunks; (ii) EHID slashes the false positive rate of the integrated Bloom filter. These salient features push EHID into the high-efficiency mode. Our experimental results show our approach reduces the average memory overhead of the index by 34.11% and 25.13% when using the Linux dataset and the FSL dataset, respectively. Furthermore, compared with the state-of-the-art method HAR, EHID boosts the average backup throughput by up to a factor of 2.25 with the Linux dataset, and EHID reduces the average disk I/O traffic by up to 66.21% when it comes to the FSL dataset. EHID also marginally improves the system's restore performance.

Funder

National Natural Science Foundation of China

International Cooperation Project of Guangdong Province

Science and Technology Planning Project of Guangzhou

Open Project Program of Wuhan National Laboratory for Optoelectronics

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture

Reference35 articles.

1. FSL. n.d. Traces and Snapshots Public Archive. Retrieved September 13 2021 from http://tracer.filesystems.org.

2. Kernel.org. n.d. The Linux Kernel Archives. Retrieved September 13 2021 from https://www.kernel.org.

3. Space/time trade-offs in hash coding with allowable errors

Cited by 10 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. mm-CUR: A Novel Ubiquitous, Contact-free, and Location-aware Counterfeit Currency Detection in Bundles Using Millimeter-Wave Sensor;ACM Transactions on Sensor Networks;2024-09-05

2. Hash Overhead Analysis for GOP-level Video Deduplication in Cloud Storage Environment;2024 International Conference on Smart Systems for applications in Electrical Sciences (ICSSES);2024-05-03

3. APRG:A Fair Information Granule Model Based on Adaptive Probability Replacement Resampling;2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS);2023-12-17

4. A Docker Container-Based Solution for Course Archival on Moodle: Implementation and Evaluation;2023 8th International Conference on Electrical, Electronics and Information Engineering (ICEEIE);2023-09-28

5. Research on Global BloomFilter-Based Data Routing Strategy of Deduplication in Cloud Environment;IETE Journal of Research;2023-04-10

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3