From Hyper-dimensional Structures to Linear Structures: Maintaining Deduplicated Data’s Locality-Reference-Cited by-同舟云学术

From Hyper-dimensional Structures to Linear Structures: Maintaining Deduplicated Data’s Locality

Published:2022-08-24 Issue:3 Volume:18 Page:1-28
ISSN:1553-3077
Container-title:ACM Transactions on Storage
language:en
Short-container-title:ACM Trans. Storage

Author:

Zou Xiangyu¹^ORCID,Yuan Jingsong¹,Shilane Philip²^ORCID,Xia Wen³,Zhang Haijun¹,Wang Xuan¹

Affiliation:

1. Harbin Institute of Technology, Shenzhen, China

2. Dell Technologies, Newtown, PA, USA

3. Harbin Institute of Technology, Shenzhen, China and Wuhan National Laboratory for Optoelectronics, HUST, Shenzhen, China

Abstract

Data deduplication is widely used to reduce the size of backup workloads, but it has the known disadvantage of causing poor data locality, also referred to as the fragmentation problem. This results from the gap between the hyper-dimensional structure of deduplicated data and the sequential nature of many storage devices, and this leads to poor restore and garbage collection (GC) performance. Current research has considered writing duplicates to maintain locality (e.g., rewriting) or caching data in memory or SSD, but fragmentation continues to lower restore and GC performance. Investigating the locality issue, we design a method to flatten the hyper-dimensional structured deduplicated data to a one-dimensional format, which is based on classification of each chunk’s lifecycle, and this creates our proposed data layout. Furthermore, we present a novel management-friendly deduplication framework, called MFDedup, that applies our data layout and maintains locality as much as possible. Specifically, we use two key techniques in MFDedup: Neighbor-duplicate-focus indexing (NDF) and Across-version-aware Reorganization scheme (AVAR). NDF performs duplicate detection against a previous backup, then AVAR rearranges chunks with an offline and iterative algorithm into a compact, sequential layout, which nearly eliminates random I/O during file restores after deduplication. Evaluation results with five backup datasets demonstrate that, compared with state-of-the-art techniques, MFDedup achieves deduplication ratios that are 1.12× to 2.19× higher and restore throughputs that are 1.92× to 10.02× faster due to the improved data layout. While the rearranging stage introduces overheads, it is more than offset by a nearly-zero overhead GC process. Moreover, the NDF index only requires indices for two backup versions, while the traditional index grows with the number of versions retained.

Funder

National Natural Science Foundation of China

Guangdong Basic and Applied Basic Research Foundation

Shenzhen Science and Technology Program

HITSZ-J&A Joint Laboratory of Digital Design and Intelligent Fabrication

Open Project Program of Wuhan National Laboratory for Optoelectronics

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3507921

Reference58 articles.

1. Yamini Allu, Fred Douglis, Mahesh Kamat, Ramya Prabhakar, Philip Shilane, and Rahul Ugale. 2018. Can’t we all get along? Redesigning protection storage for modern workloads. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’18).

2. NetApp Deduplication for FAS and V-Series Deployment and Implementation Guide;Alvarez C.;Technical Report TR-3505, NetApp,2011

3. George Amvrosiadis and Medha Bhadkamkar. 2015. Identifying trends in enterprise data protection systems. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’15).

4. George Amvrosiadis and Medha Bhadkamkar. 2016. Getting back up: Understanding how enterprise data backups fail. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’16). 479–492.

5. The design of a similarity based deduplication system

Cited by 20 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Secured mutual wireless communication using real and imaginary-valued artificial neuronal synchronization and attack detection;Engineering Applications of Artificial Intelligence;2024-11

2. AI-based visual speech recognition towards realistic avatars and lip-reading applications in the metaverse;Applied Soft Computing;2024-10

3. ENIMNR: Enhanced node influence maximization through node representation in social networks;Chaos, Solitons & Fractals;2024-09

4. The Design of Fast Delta Encoding for Delta Compression Based Storage Systems;ACM Transactions on Storage;2024-08-06

5. A systematic review on elliptic curve cryptography algorithm for internet of things: Categorization, application areas, and security;Computers and Electrical Engineering;2024-08