Abstract
The open-source FastLanes project aims to improve big data formats, such as Parquet, ORC and columnar database formats, in multiple ways. In this paper, we significantly accelerate decoding of all common Light-Weight Compression (LWC) schemes: DICT, FOR, DELTA and RLE through better data-parallelism. We do so by re-designing the compression layout using two main ideas: (i) generalizing the
value interleaving
technique in the basic operation of bit-(un)packing by targeting a virtual 1024-bits SIMD register, (ii) reordering the tuples in all columns of a table in the same Unified Transposed Layout that puts tuple chunks in a common "04261537" order (explained in the paper); allowing for maximum independent work for all possible basic SIMD lane widths: 8, 16, 32, and 64 bits.
We address the software development, maintenance and future-proofness challenges of increasing hardware diversity, by defining a virtual 1024-bits instruction set that consists of simple operators supported by all SIMD dialects; and also, importantly, by scalar code. The interleaved and tuple-reordered layout actually makes scalar decoding faster, extracting more data-parallelism from today's wide-issue CPUs. Importantly, the scalar version can be fully auto-vectorized by modern compilers, eliminating technical debt in software caused by platform-specific SIMD intrinsics.
Micro-benchmarks on Intel, AMD, Apple and AWS CPUs show that FastLanes accelerates decoding by factors (decoding >40 values per CPU cycle). FastLanes can make queries faster, as compressing the data reduces bandwidth needs, while decoding is almost free.
Publisher
Association for Computing Machinery (ACM)
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Reference39 articles.
1. [n.d.]. Apache Parquet. http://parquet.apache.org/. [n.d.]. Apache Parquet. http://parquet.apache.org/.
2. Integrating compression and execution in column-oriented database systems
3. A Afroozeh. 2020. Towards a New File Format for Big Data: SIMD-Friendly Composable Compression. https://homepages.cwi.nl/~boncz/msc/2020-AzimAfroozeh.pdf A Afroozeh. 2020. Towards a New File Format for Big Data: SIMD-Friendly Composable Compression. https://homepages.cwi.nl/~boncz/msc/2020-AzimAfroozeh.pdf
4. Peter A. Boncz Marcin Zukowski and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR. Peter A. Boncz Marcin Zukowski and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR.
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. NULLS!: Revisiting Null Representation in Modern Columnar Formats;Proceedings of the 20th International Workshop on Data Management on New Hardware;2024-06-09
2. Accelerating GPU Data Processing using FastLanes Compression;Proceedings of the 20th International Workshop on Data Management on New Hardware;2024-06-09
3. ALP: Adaptive Lossless floating-Point Compression;Proceedings of the ACM on Management of Data;2023-12-08
4. An Empirical Evaluation of Columnar Storage Formats;Proceedings of the VLDB Endowment;2023-10