Affiliation:
1. Renmin University of China, Beijing, China
2. The Chinese University of Hong Kong, HongKong, Hong Kong
3. Alibaba Group, China
Abstract
The problem of estimating data properties using sampling frequency histograms has attracted extensive interest in the area of databases. The properties include the number of distinct values (NDV), entropy, and so on. In the field of databases, property estimation is fundamental to complex applications. For example, NDV estimation is the foundation of query optimization, and entropy estimation is the foundation of data compression. Among them, methods originating from statistics exhibit desirable theoretical guarantees but rely on specific assumptions about the distribution of data, resulting in poor performance in real-world applications. Learning-based methods, which use information from training data, are adaptable in the real world but often lack theoretical guarantees or explainability. In addition, a unified framework for estimating these frequency-based estimators with machine learning is lacking. Given the aforementioned challenges, it is natural to wonder if a unified framework with theoretical guarantees can be established for property estimation. The recent literature has presented theoretical studies that propose estimation frameworks based on polynomials. These studies also prove estimation errors with respect to the sample size. Motivated by the above polynomial estimation framework, we propose a learning-based estimation framework with polynomial approximation, which aims to learn the coefficients of the polynomial, providing theoretical guarantees to the learning framework. Through comprehensive experiments on both synthetic and real-world datasets for estimating various data properties like NDV, entropy, and power sum, our results show the superiority of our algorithms over previous estimators.
Funder
Hong Kong ITC ITF
National Natural Science Foundation of China
Beijing Natural Science Foundation
Hong Kong RGC GRF
Beijing Outstanding Young Scientist Program
Alibaba Innovative Research Program
Publisher
Association for Computing Machinery (ACM)
Reference55 articles.
1. 2020. Airlines Departure Delay. https://www.openml.org/d/42728.
2. 2020. Voter Registration Statistics. https://www.ncsbe.gov/results-data/voterregistration-data.
3. 2021. Synthetic data generator. https://github.com/wurenzhi/learned_ndv_estimator.git.
4. 2022. Github. https://github.com/wurenzhi/learned_ndv_estimator.git.
5. 2023. Source Code of Spark. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala.