Affiliation:
1. Centrum Wiskunde & Informatica, Amsterdam, Netherlands
Abstract
IEEE 754 doubles do not exactly represent most real values, introducing rounding errors in computations and [de]serialization to text. These rounding errors inhibit the use of existing lightweight compression schemes such as Delta and Frame Of Reference (FOR), but recently new schemes were proposed: Gorilla, Chimp128, PseudoDecimals (PDE), Elf and Patas. However, their compression ratios are not better than those of general-purpose compressors such as Zstd; while [de]compression is much slower than Delta and FOR.
We propose and evaluate ALP, that significantly improves these previous schemes in both speed and compression ratio (Figure 1). We created ALP after carefully studying the datasets used to evaluate the previous schemes. To obtain speed, ALP is designed to fit vectorized execution. This turned out to be key for also improving the compression ratio, as we found in-vector commonalities to create compression opportunities. ALP is an adaptive scheme that uses a strongly enhanced version of PseudoDecimals [31] to losslessly encode doubles as integers if they originated as decimals, and otherwise uses vectorized compression of the doubles' front bits. Its high speeds stem from our implementation in scalar code that auto-vectorizes, using building blocks provided by our FastLanes library [6], and an efficient two-stage compression algorithm that first samples row-groups and then vectors.
Publisher
Association for Computing Machinery (ACM)
Reference50 articles.
1. IEEE Standard for Floating-Point Arithmetic
2. 2019. Public BI Benchmark. https://github.com/cwida/public_bi_benchmark. Accessed on: 2023-04--13. 2019. Public BI Benchmark. https://github.com/cwida/public_bi_benchmark. Accessed on: 2023-04--13.
3. 2023. FastLanes. https://github.com/cwida/FastLanes Accesed on: 2023-04--13. 2023. FastLanes. https://github.com/cwida/FastLanes Accesed on: 2023-04--13.
4. Integrating compression and execution in column-oriented database systems
5. Azim Afroozeh and P Boncz. 2020. Towards a New File Format for Big Data: SIMD-Friendly Composable Compression. Azim Afroozeh and P Boncz. 2020. Towards a New File Format for Big Data: SIMD-Friendly Composable Compression.