An Empirical Evaluation of Columnar Storage Formats-Reference-Cited by-同舟云学术

An Empirical Evaluation of Columnar Storage Formats

Published:2023-10 Issue:2 Volume:17 Page:148-161
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Zeng Xinyu¹,Hui Yulong¹,Shen Jiahong¹,Pavlo Andrew²,McKinney Wes³,Zhang Huanchen¹

Affiliation:

1. Tsinghua University

2. Carnegie Mellon University

3. Voltron Data

Abstract

Columnar storage is a core component of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed. In this paper, we revisit the most widely adopted open-source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress-test the formats' performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design decisions advantageous with modern hardware and real-world data distributions. These include using dictionary encoding by default, favoring decoding speed over compression ratio for integer encoding algorithms, making block compression optional, and embedding finer-grained auxiliary data structures. We also point out the inefficiencies in the format designs when handling common machine learning workloads and using GPUs for decoding. Our analysis identified important considerations that may guide future formats to better fit modern technology trends.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3626292.3626298

Reference118 articles.

1. 2016. File Format Benchmark - Avro JSON ORC & Parquet. https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet. 2016. File Format Benchmark - Avro JSON ORC & Parquet. https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet.

2. 2016. Format Wars: From VHS and Beta to Avro and Parquet. http://www.svds.com/dataformats/. 2016. Format Wars: From VHS and Beta to Avro and Parquet. http://www.svds.com/dataformats/.

3. 2016. Inside Capacitor BigQuery's next-generation columnar storage format. https://cloud.google.com/blog/products/bigquery/inside-capacitor-bigquerys-next-generation-columnar-storage-format. 2016. Inside Capacitor BigQuery's next-generation columnar storage format. https://cloud.google.com/blog/products/bigquery/inside-capacitor-bigquerys-next-generation-columnar-storage-format.

4. 2017. Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation? http://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html. 2017. Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation? http://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html.

5. 2017. Some comments to Daniel Abadi's blog about Apache Arrow. https://wesmckinney.com/blog/arrow-columnar-abadi/. 2017. Some comments to Daniel Abadi's blog about Apache Arrow. https://wesmckinney.com/blog/arrow-columnar-abadi/.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Performance of Null Handling in Array Databases;2023 IEEE International Conference on Big Data (BigData);2023-12-15