Affiliation:
1. RPTU Kaiserslautern-Landau, Kaiserslautern, Germany
Abstract
Data Lakes deployed in the cloud are a go-to solution for enterprise data storage. While the pay-as-you-go cost model allows flexible resource allocation and billing, it mandates an efficient use of resources like CPU hours, network traffic, and used storage. The distributed nature of cloud environments necessitates partitioning the data and processing these partitions separately. In this work, we put forward a practical solution to improve the efficiency of compression algorithms on Dremel-encoded data by clustering similarly structured nested data at ingestion time, such that compressible partitions can be created. We propose a clustering approach inspired by decision trees that outpaces even the naive partition-then-sort approach by up to factor 17.44 while also boosting the compression by up to factor 2. We further show that when sorting the individual buckets, a compression boost that is competitive with the well-established increasing-cardinality heuristic can be achieved, but at a lower ingestion time.
Publisher
Association for Computing Machinery (ACM)
Reference66 articles.
1. Integrating compression and execution in column-oriented database systems
2. Proteus: Autonomous Adaptive Storage for Mixed Workloads
3. Apache Software Foundation. 2013. Apache Parquet. https://parquet.apache.org/ (Last accessed: July 9, 2024)
4. Apache Software Foundation. 2014. Apache Spark. https://spark.apache.org/ (Last accessed: July 9, 2024)
5. Apache Software Foundation. 2017. Apache Iceberg. https://iceberg.apache.org/ (Last accessed: July 9, 2024)