An End-to-end High-performance Deduplication Scheme for Docker Registries and Docker Container Storage Systems

Author:

Zhao Nannan1ORCID,Lin Muhui2ORCID,Albahar Hadeel3ORCID,Paul Arnab K.4ORCID,Huan Zhijie1ORCID,Abraham Subil5ORCID,Chen Keren6ORCID,Tarasov Vasily7ORCID,Skourtis Dimitrios8ORCID,Anwar Ali9ORCID,Butt Ali10ORCID

Affiliation:

1. Northwestern Polytechnical University, Xi'an, China

2. Alibaba Group, Hangzhou, China

3. Kuwait University, Kuwait, Kuwait

4. BITS Pilani - KK Birla Goa Campus, Zuarinagar, India

5. Oak Ridge National Laboratory, Oak Ridge, USA

6. Virginia Tech, Blacksburg, USA

7. IBM Research-Almaden, San Jose, USA

8. IBM Research - Almaden, San Jose, USA

9. University of Minnesota, Twin Cities, Minneapolis, USA

10. Virginia Tech., Blacksburg, USA

Abstract

The wide adoption of Docker containers for supporting agile and elastic enterprise applications has led to a broad proliferation of container images. The associated storage performance and capacity requirements place a high pressure on the infrastructure of container registries that store and distribute images and container storage systems on the Docker client side that manage image layers and store ephemeral data generated at container runtime. The storage demand is worsened by the large amount of duplicate data in images. Moreover, container storage systems that use Copy-on-Write (CoW) file systems as storage drivers exacerbate the redundancy. Exploiting the high file redundancy in real-world images is a promising approach to drastically reduce the growing storage requirements of container registries and improve the space efficiency of container storage systems. However, existing deduplication techniques significantly degrade the performance of both registries and container storage systems because of data reconstruction overhead as well as the deduplication cost. We propose DupHunter, an end-to-end deduplication scheme that deduplicates layers for both Docker registries and container storage systems while maintaining a high image distribution speed and container I/O performance. DupHunter is divided into three tiers: registry tier, middle tier, and client tier. Specifically, we first build a high-performance deduplication engine at the registry tier that not only natively deduplicates layers for space savings but also reduces layer restore overhead. Then, we use deduplication offloading at the middle tier to eliminate the redundant files from the client tier and avoid bringing deduplication overhead to the clients. To further reduce the data duplicates caused by CoWs and improve the container I/O performance, we utilize a container-aware storage system at the client tier that reserves space for each container and arranges the placement of files and their modifications on the disk to preserve locality. Under real workloads, DupHunter reduces storage space by up to 6.9× and reduces the GET layer latency up to 2.8× compared to the state-of-the-art. Moreover, DupHunter can improve the container I/O performance by up to 93% for reads and 64% for writes.

Funder

Guangdong Basic and Applied Basic Research Foundation

National Science Foundation for Young Scientists of China

Chinese National Key Research and Development Program

Shaanxi Key Research and Development Program

Major Research Plan of the National Natural Science Foundation of China

National Science Foundation of China for General Program

BITS Pilani-BBF/BIT

NSF

Publisher

Association for Computing Machinery (ACM)

Reference92 articles.

1. A comparison of software and hardware techniques for x86 virtualization;Adams Keith;ACM SIGOPS Operat. Syst. Rev.,2006

2. Alfred Krohmer. 2023. Proposal: Deduplicated Storage and Transfer of Container Images. Retrieved from https://gist.github.com/devkid/5249ea4c88aab4c7bff1b34c955c1980

3. Aliyun Open Storage Service (Aliyun OSS). Retrieved from https://cn.aliyun.com/product/oss?spm=5176.683009.2.4.Wma3SL

4. Amazon. 2023. Amazon Elastic Container Registry. Retrieved from https://aws.amazon.com/ecr/

5. Amazon. 2023. Containers on AWS. Retrieved from https://aws.amazon.com/containers/services/

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3