Reordering rows for better compression-Reference-Cited by-同舟云学术

Reordering rows for better compression

Published:2012-08 Issue:3 Volume:37 Page:1-29
ISSN:0362-5915
Container-title:ACM Transactions on Database Systems
language:en
Short-container-title:ACM Trans. Database Syst.

Author:

Lemire Daniel¹,Kaser Owen²,Gutarra Eduardo²

Affiliation:

1. TELUQ

2. University of New Brunswick, Saint John

Abstract

Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression. A new heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades off compression for a major running-time speedup, is a good option for very large tables. However, for some compression schemes, it is more important to generate long runs rather than few runs. For this case, another novel heuristic, Vortex, is promising. We find that we can improve run-length encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. We prove that the new row reordering is optimal (within 10%) at minimizing the runs of identical values within columns, in a few cases.

Funder

Natural Sciences and Engineering Research Council of Canada

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/2338626.2338633

Reference81 articles.

1. Integrating compression and execution in column-oriented database systems

2. Abadi D. J. Madden S. R. and Hachem N. 2008. Column-stores vs. row-stores: How different are they really? In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM New York 967--980. 10.1145/1376616.1376712 Abadi D. J. Madden S. R. and Hachem N. 2008. Column-stores vs. row-stores: How different are they really? In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM New York 967--980. 10.1145/1376616.1376712

3. Mixed-Radix Gray Codes in Lee Metric

Cited by 20 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion Pipelines;Proceedings of the VLDB Endowment;2024-07

2. REGER: Reordering Time Series Data for Regression Encoding;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

3. Schema-based Column Reordering for Dremel-encoded Data;Proceedings of the International Workshop on Big Data in Emergent Distributed Environments;2023-06-18

4. Ameliorating data compression and query performance through cracked Parquet;Proceedings of The International Workshop on Big Data in Emergent Distributed Environments;2022-06-12

5. SortComp (Sort-and-Compress) - Towards a Universal Lossless Compression Scheme for Matrix and Tabular Data;2022 Data Compression Conference (DCC);2022-03