Affiliation:
1. BIFOLD & TU Berlin
2. Cornell University
3. IIT Delhi
4. BIFOLD, TU Berlin & DFKI
Abstract
Efficient data discovery is crucial in the era of data-driven decisionmaking. However, current practices face significant challenges due to the intricacies of identifying datasets with specific distributional characteristics, such as percentiles, when data repositories are decentralized. Traditional keyword-based search methods are insufficient for these complex requirements, often resulting in suboptimal dataset search results. To address these challenges, this paper presents Fainder, a fast and accurate index for "percentile predicates" on histogram-based data summaries, which streamlines the search process for datasets with specific distributional requirements. Fainder can be constructed on heterogeneous histogram collections and employs binary search in conjunction with multi-step pruning techniques to efficiently identify search results for percentile predicates. Thereby, it simplifies data provisioning and improves the effectiveness of dataset discovery. Empirical evaluation of our solution on three large-scale data repositories shows that Fainder is effective for distribution-aware dataset search and provides order-of-magnitude efficiency gains over baselines.
Publisher
Association for Computing Machinery (ACM)
Reference63 articles.
1. Detecting data errors
2. Profiling relational data: a survey
3. Towards distribution-aware query answering in data markets
4. A Survey of Data Marketplaces and Their Business Models
5. Rachel Behar and Sara Cohen. 2020. Optimal Histograms with Outliers. Proceedings of the 23rd International Conference on Extending Database Technology (EDBT '20), 181--192.