Cluster and Single-Node Analysis of Long-Term Deduplication Patterns-Reference-Cited by-同舟云学术

Cluster and Single-Node Analysis of Long-Term Deduplication Patterns

Published:2018-05-25 Issue:2 Volume:14 Page:1-27
ISSN:1553-3077
Container-title:ACM Transactions on Storage
language:en
Short-container-title:ACM Trans. Storage

Author:

Sun Zhen “Jason”¹,Kuenning Geoff²,Mandal Sonam³,Shilane Philip⁴,Tarasov Vasily⁵,Xiao Nong¹,Zadok Erez³

Affiliation:

1. National University of Defense Technology, Hunan, P.R.China

2. Harvey Mudd College, CA, USA

3. Stony Brook University, NY, USA

4. Dell EMC, PA, USA

5. IBM Research, California, USA

Abstract

Deduplication has become essential in disk-based backup systems, but there have been few long-term studies of backup workloads. Most past studies either were of a small static snapshot or covered only a short period that was not representative of how a backup system evolves over time. For this article, we first collected 21 months of data from a shared user file system; 33 users and over 4,000 snapshots are covered. We then analyzed the dataset, examining a variety of essential characteristics across two dimensions: single-node deduplication and cluster deduplication. For single-node deduplication analysis, our primary focus was individual-user data. Despite apparently similar roles and behavior among all of our users, we found significant differences in their deduplication ratios. Moreover, the data that some users share with others had a much higher deduplication ratio than average. For cluster deduplication analysis, we implemented seven published data-routing algorithms and created a detailed comparison of their performance with respect to deduplication ratio, load distribution, and communication overhead. We found that per-file routing achieves a higher deduplication ratio than routing by super-chunk (multiple consecutive chunks), but it also leads to high data skew (imbalance of space usage across nodes). We also found that large chunking sizes are better for cluster deduplication, as they significantly reduce data-routing overhead, while their negative impact on deduplication ratios is small and acceptable. We draw interesting conclusions from both single-node and cluster deduplication analysis and make recommendations for future deduplication systems design.

Funder

China 863

ONR

National Natural Science Foundation of China

NSF

Dell-EMC

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3183890

Reference43 articles.

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. ObjDedup: High-Throughput Object Storage Layer for Backup Systems With Block-Level Deduplication;IEEE Transactions on Parallel and Distributed Systems;2023-07

2. Dataset Similarity Detection for Global Deduplication in the DD File System;2023 IEEE 39th International Conference on Data Engineering (ICDE);2023-04

3. Whole-File Chunk-Based Deduplication Using Reinforcement Learning for Cloud Storage;2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM);2022-11-10

4. Performance Centric Primary Storage Deduplication Systems Exploiting Caching and Block Similarity;2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM);2022-01-03

5. Enabling Secure and Space-Efficient Metadata Management in Encrypted Deduplication;IEEE Transactions on Computers;2021