Accurate Sampling-Based Cardinality Estimation for Complex Graph Queries

Author:

Hu Pan1ORCID,Motik Boris2ORCID

Affiliation:

1. School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China

2. Department of Computer Science, Oxford University, Oxford United Kingdom of Great Britain and Northern Ireland

Abstract

Accurately estimating the cardinality (i.e., the number of answers) of complex queries plays a central role in database systems. This problem is particularly difficult in graph databases, where queries often involve a large number of joins and self-joins. Recently, Park et al. [55] surveyed seven state-of-the-art cardinality estimation approaches for graph queries. The results of their extensive empirical evaluation show that a sampling method based on the WanderJoin online aggregation algorithm [47] consistently offers superior accuracy. We extended the framework by Park et al. [55] with three additional datasets and repeated their experiments. Our results showed that WanderJoin is indeed very accurate, but it can often take a large number of samples and thus be very slow. Moreover, when queries are complex and data distributions are skewed, it often fails to find valid samples and estimates the cardinality as zero. Finally, complex graph queries often go beyond simple graph matching and involve arbitrary nesting of relational operators such as disjunction, difference, and duplicate elimination. Neither of the methods considered by Park et al. [55] is applicable to such queries. In this paper we present a novel approach for estimating the cardinality of complex graph queries. Our approach is inspired by WanderJoin, but, unlike all approaches known to us, it can process complex queries with arbitrary operator nesting. Our estimator is strongly consistent, meaning that the average of repeated estimates converges with probability one to the actual cardinality. We present optimisations of the basic algorithm that aim to reduce the chance of producing zero estimates and improve accuracy. We show empirically that our approach is both accurate and quick on complex queries and large datasets. Finally, we discuss how to integrate our approach into a simple dynamic programming query planner, and we confirm empirically that our planner produces high-quality plans that can significantly reduce end-to-end query evaluation times.

Publisher

Association for Computing Machinery (ACM)

Reference77 articles.

1. D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. 2007. Scalable Semantic Web Data Management Using Vertical Partitioning. In Proc. of the 33rd Int. Conf. on Very Large Data Bases (VLDB 2007). VLDB Endowment, Vienna, Austria, 411–422.

2. A. Aboulnaga and S. Chaudhuri. 1999. Self-tuning Histograms: Building Histograms Without Looking at Data. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD 1999). ACM, Philadelphia, PA, USA, 181–192.

3. S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. 1999. Join Synopses for Approximate Query Answering. In Proc. of the 1999 Int. Conf. on Management of Data (SIGMOD 1999). ACM Press, Philadelphia, PA, USA, 275–286.

4. G. Aluç, O. Hartig, M. Tamer Özsu, and K. Daudjee. 2014. Diversified Stress Testing of RDF Data Management Systems. In Proc. of the 13th Int. Semantic Web Conf. (ISWC 2014). Springer, Riva del Garda, Italy, 197–212.

5. Compressed vertical partitioning for efficient RDF management

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3