Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion Pipelines-Reference-Cited by-同舟云学术

Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion Pipelines

Published:2024-07 Issue:11 Volume:17 Page:3456-3469
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Hansert Patrick¹,Michel Sebastian¹

Affiliation:

1. RPTU Kaiserslautern-Landau, Kaiserslautern, Germany

Abstract

Data Lakes deployed in the cloud are a go-to solution for enterprise data storage. While the pay-as-you-go cost model allows flexible resource allocation and billing, it mandates an efficient use of resources like CPU hours, network traffic, and used storage. The distributed nature of cloud environments necessitates partitioning the data and processing these partitions separately. In this work, we put forward a practical solution to improve the efficiency of compression algorithms on Dremel-encoded data by clustering similarly structured nested data at ingestion time, such that compressible partitions can be created. We propose a clustering approach inspired by decision trees that outpaces even the naive partition-then-sort approach by up to factor 17.44 while also boosting the compression by up to factor 2. We further show that when sorting the individual buckets, a compression boost that is competitive with the well-established increasing-cardinality heuristic can be achieved, but at a lower ingestion time.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.14778/3681954.3682013

Reference66 articles.

1. Integrating compression and execution in column-oriented database systems

2. Proteus: Autonomous Adaptive Storage for Mixed Workloads

3. Apache Software Foundation. 2013. Apache Parquet. https://parquet.apache.org/ (Last accessed: July 9, 2024)

4. Apache Software Foundation. 2014. Apache Spark. https://spark.apache.org/ (Last accessed: July 9, 2024)

5. Apache Software Foundation. 2017. Apache Iceberg. https://iceberg.apache.org/ (Last accessed: July 9, 2024)