Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling-Reference-Cited by-同舟云学术

Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling

Published:2021-11-30 Issue:4 Volume:17 Page:1-23
ISSN:1553-3077
Container-title:ACM Transactions on Storage
language:en
Short-container-title:ACM Trans. Storage

Author:

Zhang Datong¹,Deng Yuhui¹,Zhou Yi²,Zhu Yifeng³,Qin Xiao⁴

Affiliation:

1. Department of Computer Science, Jinan University, Guangzhou, Guangdong Province, China

2. TSYS School of Computer Science, Columbus State University, GA, USA

3. School of Electrical and Computer Engineering, University of Maine, Orono, ME, USA

4. Department of Computer Science and Software Engineering, Auburn University, Alabama, USA

Abstract

Data deduplication techniques construct an index consisting of fingerprint entries to identify and eliminate duplicated copies of repeating data. The bottleneck of disk-based index lookup and data fragmentation caused by eliminating duplicated chunks are two challenging issues in data deduplication. Deduplication-based backup systems generally employ containers storing contiguous chunks together with their fingerprints to preserve data locality for alleviating the two issues, which is still inadequate. To address these two issues, we propose a container utilization based hot fingerprint entry distilling strategy to improve the performance of deduplication-based backup systems. We divide the index into three parts: hot fingerprint entries, fragmented fingerprint entries, and useless fingerprint entries. A container with utilization smaller than a given threshold is called a sparse container . Fingerprint entries that point to non-sparse containers are hot fingerprint entries. For the remaining fingerprint entries, if a fingerprint entry matches any fingerprint of forthcoming backup chunks, it is classified as a fragmented fingerprint entry. Otherwise, it is classified as a useless fingerprint entry. We observe that hot fingerprint entries account for a small part of the index, whereas the remaining fingerprint entries account for the majority of the index. This intriguing observation inspires us to develop a hot fingerprint entry distilling approach named HID . HID segregates useless fingerprint entries from the index to improve memory utilization and bypass disk accesses. In addition, HID separates fragmented fingerprint entries to make a deduplication-based backup system directly rewrite fragmented chunks, thereby alleviating adverse fragmentation. Moreover, HID introduces a feature to treat fragmented chunks as unique chunks. This feature compensates for the shortcoming that a Bloom filter cannot directly identify certain duplicated chunks (i.e., the fragmented chunks). To take full advantage of the preceding feature, we propose an evolved HID strategy called EHID . EHID incorporates a Bloom filter, to which only hot fingerprints are mapped. In doing so, EHID exhibits two salient features: (i) EHID avoids disk accesses to identify unique chunks and the fragmented chunks; (ii) EHID slashes the false positive rate of the integrated Bloom filter. These salient features push EHID into the high-efficiency mode. Our experimental results show our approach reduces the average memory overhead of the index by 34.11% and 25.13% when using the Linux dataset and the FSL dataset, respectively. Furthermore, compared with the state-of-the-art method HAR, EHID boosts the average backup throughput by up to a factor of 2.25 with the Linux dataset, and EHID reduces the average disk I/O traffic by up to 66.21% when it comes to the FSL dataset. EHID also marginally improves the system's restore performance.

Funder

National Natural Science Foundation of China

International Cooperation Project of Guangdong Province

Science and Technology Planning Project of Guangzhou

Open Project Program of Wuhan National Laboratory for Optoelectronics

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3459626

Reference35 articles.

1. FSL. n.d. Traces and Snapshots Public Archive. Retrieved September 13 2021 from http://tracer.filesystems.org.

2. Kernel.org. n.d. The Linux Kernel Archives. Retrieved September 13 2021 from https://www.kernel.org.

3. Space/time trade-offs in hash coding with allowable errors

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. mm-CUR: A Novel Ubiquitous, Contact-free, and Location-aware Counterfeit Currency Detection in Bundles Using Millimeter-Wave Sensor;ACM Transactions on Sensor Networks;2024-09-05

2. Hash Overhead Analysis for GOP-level Video Deduplication in Cloud Storage Environment;2024 International Conference on Smart Systems for applications in Electrical Sciences (ICSSES);2024-05-03

3. APRG:A Fair Information Granule Model Based on Adaptive Probability Replacement Resampling;2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS);2023-12-17

4. A Docker Container-Based Solution for Course Archival on Moodle: Implementation and Evaluation;2023 8th International Conference on Electrical, Electronics and Information Engineering (ICEEIE);2023-09-28

5. Research on Global BloomFilter-Based Data Routing Strategy of Deduplication in Cloud Environment;IETE Journal of Research;2023-04-10