Affiliation:
1. AT&T Labs, Florham Park, New Jersey
2. Universitá di Palermo, Palermo, Italy
Abstract
We study the problem of compressing massive tables within the partition-training paradigm introduced by Buchsbaum et al. [2000], in which a table is partitioned by an off-line training procedure into disjoint intervals of columns, each of which is compressed separately by a standard, on-line compressor like gzip. We provide a new theory that unifies previous experimental observations on partitioning and heuristic observations on column permutation, all of which are used to improve compression rates. Based on this theory, we devise the first on-line training algorithms for table compression, which can be applied to individual files, not just continuously operating sources; and also a new, off-line training algorithm, based on a link to the asymmetric traveling salesman problem, which improves on prior work by rearranging columns prior to partitioning. We demonstrate these results experimentally. On various test files, the on-line algorithms provide 35--55% improvement over gzip with negligible slowdown; the off-line reordering provides up to 20% further improvement over partitioning alone. We also show that a variation of the table compression problem is MAX-SNP hard.
Publisher
Association for Computing Machinery (ACM)
Subject
Artificial Intelligence,Hardware and Architecture,Information Systems,Control and Systems Engineering,Software
Cited by
25 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion Pipelines;Proceedings of the VLDB Endowment;2024-07
2. A Learned Approach to Design Compressed Rank/Select Data Structures;ACM Transactions on Algorithms;2022-07-31
3. Improving matrix-vector multiplication via lossless grammar-compressed matrices;Proceedings of the VLDB Endowment;2022-06
4. Predictive Coding for Lossless Dataset Compression;ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2021-06-06
5. IIU;Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems;2020-03-09