PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning-Reference-Cited by-同舟云学术

PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning

Published:2021-08-31 Issue:3 Volume:2 Page:1-28
ISSN:2691-1922
Container-title:ACM/IMS Transactions on Data Science
language:en
Short-container-title:ACM/IMS Trans. Data Sci.

Author:

Song Jie¹^ORCID,He Qiang²,Chen Feifei³,Yuan Ye¹,Yu Ge¹

Affiliation:

1. Northeastern University, Shenyang, Liaoning Province, China

2. Swinburne University of Technology, Hawthorn, Victoria, Australia

3. Deakin University, Docklands, Victoria, Australia

Abstract

In big data query processing, there is a trade-off between query accuracy and query efficiency, for example, sampling query approaches trade-off query completeness for efficiency. In this article, we argue that query performance can be significantly improved by slightly losing the possibility of query completeness, that is, the chance that a query is complete. To quantify the possibility, we define a new concept, Probability of query Completeness (hereinafter referred to as PC). For example, If a query is executed 100 times, PC = 0.95 guarantees that there are no more than 5 incomplete results among 100 results. Leveraging the probabilistic data placement and scanning, we trade off PC for query performance. In the article, we propose PoBery (POssibly-complete Big data quERY), a method that supports neither complete queries nor incomplete queries, but possibly-complete queries. The experimental results conducted on HiBench prove that PoBery can significantly accelerate queries while ensuring the PC. Specifically, it is guaranteed that the percentage of complete queries is larger than the given PC confidence. Through comparison with state-of-the-art key-value stores, we show that while Drill-based PoBery performs as fast as Drill on complete queries, it is 1.7 ×, 1.1 ×, and 1.5 × faster on average than Drill, Impala, and Hive, respectively, on possibly-complete queries.

Funder

National Natural Science Foundation of China

Natural Science Foundation of Liaoning Province

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3465375

Reference53 articles.

1. Performance Evaluation of NoSQL Databases

2. Robust hashing with local models for approximate similarity search;Jingkuan Song;IEEE Transactions on Cybernetics,2014

3. Dynamic sample selection for approximate query processing

4. Apache Drill: Interactive Ad-Hoc Analysis at Scale

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. L/STIM: A Framework for Detecting Multi-Stage Cyber Attacks;2024 International Russian Smart Industry Conference (SmartIndustryCon);2024-03-25