InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization

Author:

Lin Lifang1ORCID,Deng Yuhui1ORCID,Zhou Yi2ORCID,Zhu Yifeng3ORCID

Affiliation:

1. Department of Computer Science, Jinan University, Guangzhou, Guangdong Province, China

2. TSYS School of Computer Science, Columbus State University, GA, USA

3. Department of Electrical and Computer Engineering, University of Maine, Orono, ME, USA

Abstract

Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers. To address this issue, we propose an inline deduplication approach for storage systems, called InDe , which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, valid container referenced counts (VCRC) , to identify appropriate containers for the rewrite. We design a rewrite algorithm F-greedy that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called F-greedy+ based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.

Funder

National Natural Science Foundation of China

Guangdong Basic and Applied Basic Research Foundation

International Cooperation Project of Guangdong Province

Science and Technology Planning Project of Guangzhou

Open Project Program of Wuhan National Laboratory for Optoelectronics

Industry-University-Research Collaboration Project of Zhuhai

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture

Reference41 articles.

1. Dell Technologies. 2021. IDC The Business Value of Storage Solutions from Dell Technologies. Retrieved from https://www.delltechnologies.com/asset/zh-cn/products/storage/industry-market/idc-the-business-value-of-storage-solutions-from-dell-technologies.pdf.

2. FSL. 2021. Traces and Snapshots Public Archive. Retrieved from https://tracer.filesystems.org/.

3. R. Bauer. 2018. HDD vs SSD: What Does the Future for Storage Hold? Retrieved from https://www.backblaze.com/blog/hdd-vs-ssd-in-data-centers/.

4. Zhichao Cao, Shiyong Liu, Fenggang Wu, Guohua Wang, Bingzhe Li, and David H. C. Du. 2019. Sliding look-back window assisted data chunk rewriting for improving deduplication restore performance. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST’19). 129–142.

5. TIGER: Thermal-aware file assignment in storage clusters;Chavan Ajit;IEEE Trans. Parallel Distrib. Syst.,2015

Cited by 6 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Fog-assisted de-duplicated data exchange in distributed edge computing networks;Scientific Reports;2024-09-04

2. The Design of Fast Delta Encoding for Delta Compression Based Storage Systems;ACM Transactions on Storage;2024-08-06

3. APRG:A Fair Information Granule Model Based on Adaptive Probability Replacement Resampling;2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS);2023-12-17

4. An exploratory analysis of methods for real-time data deduplication in streaming processes;Proceedings of the 17th ACM International Conference on Distributed and Event-based Systems;2023-06-27

5. Research on Global BloomFilter-Based Data Routing Strategy of Deduplication in Cloud Environment;IETE Journal of Research;2023-04-10

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3