Affiliation:
1. Penn State University, University Park, PA, USA
Abstract
The execution of analytical queries on massive datasets presents challenges due to long response times and high computational costs. As a result, the analysis of representative samples of data has emerged as an attractive alternative; this avoids the cost of processing queries against the entire dataset, while still producing statistically valid results. Unfortunately, the sampling techniques in common use sacrifice either sample quality or performance, and so are poorly suited for this task. However, it is possible to build high quality sample sets efficiently with the assistance of indexes. This introduces a new challenge: real-world data is subject to continuous update, and so the indexes must be kept up to date. This is difficult, because existing sampling indexes present a dichotomy; efficient sampling indexes are difficult to update, while easily updatable indexes have poor sampling performance. This paper seeks to address this gap by proposing a general and practical framework for extending most sampling indexes with efficient update support, based on splitting indexes into smaller shards, combined with a systematic approach to the periodic reconstruction. The framework's design space is examined, with an eye towards exploring trade-offs between update performance, sampling performance, and memory usage. Three existing static sampling indexes are extended using this framework to support updates, and the generalization of the framework to concurrent operations and larger-than-memory data is discussed. Through a comprehensive suite of benchmarks, the extended indexes are shown to match or exceed the update throughput of state-of-the-art dynamic baselines, while presenting significant improvements in sampling latency.
Publisher
Association for Computing Machinery (ACM)
Reference54 articles.
1. 2023. Delicious Dataset. http://konect.cc/networks/delicious-ti/ 2023. Delicious Dataset. http://konect.cc/networks/delicious-ti/
2. 2023. Open Street Map Dataset. https://planet.openstreetmap.org/ 2023. Open Street Map Dataset. https://planet.openstreetmap.org/
3. 2023. PostgreSQL Documentation. https://www.postgresql.org/docs/15/sql-select.html 2023. PostgreSQL Documentation. https://www.postgresql.org/docs/15/sql-select.html
4. 2023. Twitter Dataset. https://github.com/ANLAB-KAIST/traces/releases/tag/twitter_rv.net 2023. Twitter Dataset. https://github.com/ANLAB-KAIST/traces/releases/tag/twitter_rv.net
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Towards Systematic Index Dynamization;Proceedings of the VLDB Endowment;2024-07