From Hyper-dimensional Structures to Linear Structures: Maintaining Deduplicated Data’s Locality

Author:

Zou Xiangyu1ORCID,Yuan Jingsong1,Shilane Philip2ORCID,Xia Wen3,Zhang Haijun1,Wang Xuan1

Affiliation:

1. Harbin Institute of Technology, Shenzhen, China

2. Dell Technologies, Newtown, PA, USA

3. Harbin Institute of Technology, Shenzhen, China and Wuhan National Laboratory for Optoelectronics, HUST, Shenzhen, China

Abstract

Data deduplication is widely used to reduce the size of backup workloads, but it has the known disadvantage of causing poor data locality, also referred to as the fragmentation problem. This results from the gap between the hyper-dimensional structure of deduplicated data and the sequential nature of many storage devices, and this leads to poor restore and garbage collection (GC) performance. Current research has considered writing duplicates to maintain locality (e.g., rewriting) or caching data in memory or SSD, but fragmentation continues to lower restore and GC performance. Investigating the locality issue, we design a method to flatten the hyper-dimensional structured deduplicated data to a one-dimensional format, which is based on classification of each chunk’s lifecycle, and this creates our proposed data layout. Furthermore, we present a novel management-friendly deduplication framework, called MFDedup, that applies our data layout and maintains locality as much as possible. Specifically, we use two key techniques in MFDedup: Neighbor-duplicate-focus indexing (NDF) and Across-version-aware Reorganization scheme (AVAR). NDF performs duplicate detection against a previous backup, then AVAR rearranges chunks with an offline and iterative algorithm into a compact, sequential layout, which nearly eliminates random I/O during file restores after deduplication. Evaluation results with five backup datasets demonstrate that, compared with state-of-the-art techniques, MFDedup achieves deduplication ratios that are 1.12× to 2.19× higher and restore throughputs that are 1.92× to 10.02× faster due to the improved data layout. While the rearranging stage introduces overheads, it is more than offset by a nearly-zero overhead GC process. Moreover, the NDF index only requires indices for two backup versions, while the traditional index grows with the number of versions retained.

Funder

National Natural Science Foundation of China

Guangdong Basic and Applied Basic Research Foundation

Shenzhen Science and Technology Program

HITSZ-J&A Joint Laboratory of Digital Design and Intelligent Fabrication

Open Project Program of Wuhan National Laboratory for Optoelectronics

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture

Reference58 articles.

1. Yamini Allu, Fred Douglis, Mahesh Kamat, Ramya Prabhakar, Philip Shilane, and Rahul Ugale. 2018. Can’t we all get along? Redesigning protection storage for modern workloads. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’18).

2. NetApp Deduplication for FAS and V-Series Deployment and Implementation Guide;Alvarez C.;Technical Report TR-3505, NetApp,2011

3. George Amvrosiadis and Medha Bhadkamkar. 2015. Identifying trends in enterprise data protection systems. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’15).

4. George Amvrosiadis and Medha Bhadkamkar. 2016. Getting back up: Understanding how enterprise data backups fail. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’16). 479–492.

5. The design of a similarity based deduplication system

Cited by 20 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3