Abstract
We study the problem of efficiently estimating counts for queries involving complex filters, such as user-defined functions, or predicates involving self-joins and correlated subqueries. For such queries, traditional sampling techniques may not be applicable due to the complexity of the filter preventing sampling over joins, and sampling after the join may not be feasible due to the cost of computing the full join. The other natural approach of training and using an inexpensive classifier to estimate the count instead of the expensive predicate suffers from the difficulties in training a good classifier and giving meaningful confidence intervals. In this paper we propose a new method of
learning to sample
where we combine the best of both worlds by using sampling in two phases. First, we use samples to learn a probabilistic classifier, and then use the classifier to design a stratified sampling method to obtain the final estimates. We theoretically analyze algorithms for obtaining an optimal stratification, and compare our approach with a suite of natural alternatives like quantification learning, weighted and stratified sampling, and other techniques from the literature. We also provide extensive experiments in diverse use cases using multiple real and synthetic datasets to evaluate the quality, efficiency, and robustness of our approach.
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. CDFRS: A scalable sampling approach for efficient big data analysis;Information Processing & Management;2024-07
2. Automating localized learning for cardinality estimation based on XGBoost;Knowledge and Information Systems;2024-06-01
3. A simple and efficient point cloud sampling strategy based on cluster merging;2023 3rd International Conference on Robotics, Automation and Intelligent Control (ICRAIC);2023-11-24
4. Tuple Bubbles: Learned Tuple Representations for Tunable Approximate Query Processing;Proceedings of the Sixth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management;2023-06-18
5. JanusAQP: Efficient Partition Tree Maintenance for Dynamic Approximate Query Processing;2023 IEEE 39th International Conference on Data Engineering (ICDE);2023-04