High-Ratio Compression for Machine-Generated Data

Author:

Zhang Jiujing1ORCID,Shen Zhitao2ORCID,Yang Shiyu3ORCID,Meng Lingkai4ORCID,Xiao Chuan5ORCID,Jia Wei2ORCID,Li Yue2ORCID,Sun Qinhui2ORCID,Zhang Wenjie6ORCID,Lin Xuemin7ORCID

Affiliation:

1. Guangzhou University & University of New South Wales, Guangzhou, China

2. Ant Group, Shanghai, China

3. Guangzhou University, Guangzhou, China

4. Ant Group & Shanghai Jiao Tong University, Shanghai, China

5. Osaka University & Nagoya University, Osaka & Nagoya, Japan

6. University of New South Wales, Sydney, NSW, Australia

7. Shanghai Jiao Tong University, Shanghai, China

Abstract

Machine-generated data is rapidly growing and poses challenges for data-intensive systems, especially as the growth of data outpaces the growth of storage space. To cope with the storage issue, compression plays a critical role in storage engines, particularly for data-intensive applications, where a high compression ratio and efficient random access are essential. However, existing compression techniques tend to focus on general-purpose and data block approaches, but overlook the inherent structure of machine-generated data and hence result in low compression ratios or limited lookup efficiency. To address these limitations, we introduce the Pattern-Based Compression (PBC) algorithm, which specifically targets patterns in machine-generated data to achieve Pareto-optimality in most cases. Unlike traditional data block-based methods, PBC compresses data on a per-record basis, facilitating rapid random access. Our experimental evaluation demonstrates that PBC, on average, achieves a compression ratio twice as high as the state-of-the-art techniques while maintaining competitive compression and decompression speeds. We also integrate PBC to a production database system and achieve improvements on both comparison ratio and throughput.

Funder

JSPS Kakenhi

National Key R&D Program of China

ARC Future Fellowship

NSFC

the scholarship of China Scholarship Council

CCF-AFSG Research Fund

ARC Discovery Project

CREST

GuangDong Basic and Applied Basic Research Foundation

Publisher

Association for Computing Machinery (ACM)

Reference67 articles.

1. Amazon. 2016. Amazon Ion. https://amazon-ion.github.io/ Amazon. 2016. Amazon Ion. https://amazon-ion.github.io/

2. Apache. 2018. Apache ORC High-Performance Columnar Storage for Hadoop. https://orc.apache.org Apache. 2018. Apache ORC High-Performance Columnar Storage for Hadoop. https://orc.apache.org

3. SPARTAN

4. HOT

5. Dictionary-based order-preserving string compression for main memory column stores

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3