Cluster and Single-Node Analysis of Long-Term Deduplication Patterns

Author:

Sun Zhen “Jason”1,Kuenning Geoff2,Mandal Sonam3,Shilane Philip4,Tarasov Vasily5,Xiao Nong1,Zadok Erez3

Affiliation:

1. National University of Defense Technology, Hunan, P.R.China

2. Harvey Mudd College, CA, USA

3. Stony Brook University, NY, USA

4. Dell EMC, PA, USA

5. IBM Research, California, USA

Abstract

Deduplication has become essential in disk-based backup systems, but there have been few long-term studies of backup workloads. Most past studies either were of a small static snapshot or covered only a short period that was not representative of how a backup system evolves over time. For this article, we first collected 21 months of data from a shared user file system; 33 users and over 4,000 snapshots are covered. We then analyzed the dataset, examining a variety of essential characteristics across two dimensions: single-node deduplication and cluster deduplication. For single-node deduplication analysis, our primary focus was individual-user data. Despite apparently similar roles and behavior among all of our users, we found significant differences in their deduplication ratios. Moreover, the data that some users share with others had a much higher deduplication ratio than average. For cluster deduplication analysis, we implemented seven published data-routing algorithms and created a detailed comparison of their performance with respect to deduplication ratio, load distribution, and communication overhead. We found that per-file routing achieves a higher deduplication ratio than routing by super-chunk (multiple consecutive chunks), but it also leads to high data skew (imbalance of space usage across nodes). We also found that large chunking sizes are better for cluster deduplication, as they significantly reduce data-routing overhead, while their negative impact on deduplication ratios is small and acceptable. We draw interesting conclusions from both single-node and cluster deduplication analysis and make recommendations for future deduplication systems design.

Funder

China 863

ONR

National Natural Science Foundation of China

NSF

Dell-EMC

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture

Reference43 articles.

Cited by 9 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. ObjDedup: High-Throughput Object Storage Layer for Backup Systems With Block-Level Deduplication;IEEE Transactions on Parallel and Distributed Systems;2023-07

2. Dataset Similarity Detection for Global Deduplication in the DD File System;2023 IEEE 39th International Conference on Data Engineering (ICDE);2023-04

3. Whole-File Chunk-Based Deduplication Using Reinforcement Learning for Cloud Storage;2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM);2022-11-10

4. Performance Centric Primary Storage Deduplication Systems Exploiting Caching and Block Similarity;2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM);2022-01-03

5. Enabling Secure and Space-Efficient Metadata Management in Encrypted Deduplication;IEEE Transactions on Computers;2021

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3