Cuckoo index-Reference-Cited by-同舟云学术

Cuckoo index

Published:2020-09 Issue:13 Volume:13 Page:3559-3572
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Kipf Andreas¹,Chromejko Damian²,Hall Alexander³,Boncz Peter⁴,Andersen David G.⁵

Affiliation:

1. MIT CSAIL

2. Google

3. RelationalAI

4. CWI

5. CMU

Abstract

In modern data warehousing, data skipping is essential for high query performance. While index structures such as B-trees or hash tables allow for precise pruning, their large storage requirements make them impractical for indexing secondary columns. Therefore, many systems rely on approximate indexes such as min/max sketches (ZoneMaps) or Bloom filters for cost-effective data pruning. For example, Google PowerDrill skips more than 90% of data on average using such indexes. In this paper, we introduce Cuckoo Index (CI), an approximate secondary index structure that represents the many-to-many relationship between keys and data partitions in a highly space-efficient way. At its core, CI associates variable-sized fingerprints in a Cuckoo filter with compressed bitmaps indicating qualifying partitions. With our approach, we target equality predicates in a read-only (immutable) setting and optimize for space efficiency under the premise of practical build and lookup performance. In contrast to per-partition (Bloom) filters, CI produces correct results for lookups with keys that occur in the data. CI allows to control the ratio of false positive partitions for lookups with non-occurring keys. Our experiments with real-world and synthetic data show that CI consumes significantly less space than per-partition filters for the same pruning power for low-to-medium cardinality columns. For high cardinality columns, CI is on par with its baselines.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3424573.3424577

Cited by 11 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Optimizing Collections of Bloom Filters within a Space Budget;Proceedings of the VLDB Endowment;2024-07

2. Predicate Caching: Query-Driven Secondary Indexing for Cloud Data Warehouses;Companion of the 2024 International Conference on Management of Data;2024-06-09

3. Perseid: A Secondary Indexing Mechanism for LSM-based Storage Systems;ACM Transactions on Storage;2023-11-17

4. Breathing New Life into an Old Tree: Resolving Logging Dilemma of B ⁺ -tree on Modern Computational Storage Drives;Proceedings of the VLDB Endowment;2023-10

5. Sieve: A Learned Data-Skipping Index for Data Analytics;Proceedings of the VLDB Endowment;2023-07