Affiliation:
1. École Polytechnique Fédérale de Lausanne, Lausanne, VD, Switzerland
Abstract
The increasing volume of time-based generated data and the shift in storage technologies suggest that we might need to reconsider indexing. Several workloads - like social and service monitoring - often include attributes with implicit clustering because of their time-dependent nature. In addition, solid state disks (SSD) (using flash or other low-level technologies) emerge as viable competitors of hard disk drives (HDD). Capacity and access times of storage devices create a trade-off between SSD and HDD. Slow random accesses in HDD have been replaced by efficient random accesses in SSD, but their available capacity is one or more orders of magnitude more expensive than the one of HDD. Indexing, however, is designed assuming HDD as secondary storage, thus minimizing random accesses at the expense of capacity. Indexing data using SSD as secondary storage requires treating capacity as a scarce resource.
To this end, we introduce approximate tree indexing, which employs probabilistic data structures (Bloom filters) to trade accuracy for size and produce smaller, yet powerful, tree indexes, which we name Bloom filter trees (BF-Trees). BF-Trees exploit pre-existing data ordering or partitioning to offer competitive search performance. We demonstrate, both by an analytical study and by experimental results, that by using workload knowledge and reducing indexing accuracy up to some extent, we can save substantially on capacity when indexing on ordered or partitioned attributes. In particular, in experiments with a synthetic workload, approximate indexing offers 2.22x-48x smaller index footprint with competitive response times, and in experiments with TPCH and a monitoring real-life dataset from an energy company, it offers 1.6x-4x smaller index footprint with competitive search times as well.
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Cited by
45 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Benchmarking Learned and LSM Indexes for Data Sortedness;Proceedings of the Tenth International Workshop on Testing Database Systems;2024-06-09
2. Spruce: a Fast yet Space-saving Structure for Dynamic Graph Storage;Proceedings of the ACM on Management of Data;2024-03-12
3. Approximate sorting and its applications in I/O model;Theoretical Computer Science;2024-03
4. GLIN: A (G)eneric (L)earned (In)dexing Mechanism for Complex Geometries;Proceedings of the 11th ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data;2023-11-13
5. Enabling Timely and Persistent Deletion in LSM-Engines;ACM Transactions on Database Systems;2023-08-09