Forester: Approximate Processing of an Imperative Procedure for Query-Time Exploratory Data Analysis in a Relational Database-Reference-Cited by-同舟云学术

Forester: Approximate Processing of an Imperative Procedure for Query-Time Exploratory Data Analysis in a Relational Database

Published:2024-02-14 Issue:4 Volume:13 Page:759
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Rahman Md Arif¹²,Lee Young-Koo¹

Affiliation:

1. Department of Computer Science and Engineering, Kyung Hee University, Global Campus, Yongin-si 17104, Republic of Korea

2. Department of Computer Science and Engineering, Jashore University of Science and Technology, Jessore 7408, Bangladesh

Abstract

Query-time Exploratory Data Analysis (qEDA) is an increasingly demanding aspect of the data analysis process that entails visually and quantitatively summarizing, comprehending, and interpreting the primary characteristics of a dataset. Nowadays, an imperative procedure is popular in relational databases for EDA because it enables us to write multiple dependent declarative queries with imperative logic. As online analytical processing (OLAP) systems contain extremely large datasets, data scientists often need quick visualizations of data, using approximate processing of imperative procedures, before analyzing them in their entirety. We identify gaps in the existing techniques, in that they are unable to sample both declarative-dependent statements and control logic at the same time and perform multi-dependent sampling-based approximate processing within the permitted time in qEDA. Traditional approximate query processing (AQP) involves tuple sampling for a single query approximation and enables queries to be executed over arbitrary random samples of tables. However, available AQP methods cannot produce a further representative sample of the data distribution for the dependent statements to estimate accurately and quickly for multiple dependent statements. On the other hand, sampling control structures, like loops and conditional statements, are discussed separately, without regard to the imperative structure of statements in a procedure. In this study, we propose Forester, a novel agile approximate processing method for imperative procedures that performs imperative program-aware sampling, which includes both statements with control regions (i.e., branch and loop) and processes them approximately within the permitted time in qEDA. Our method produces more targeted samples for each relation, while maintaining the data and control flow of dependent queries and imperative logic and determining all the conditions for a relation across all the statements in the sample that guarantee the existence of relevant data for dependent data distribution. Utilizing a workload of multi-statement imperative procedures from the Transaction Processing Performance Council Decision Support (TPC-DS) database, our experiment demonstrates that Forester outperforms the existing system in sampling, producing minimum error, and improving response time.

Publisher

MDPI AG

Link

https://www.mdpi.com/2079-9292/13/4/759/pdf

Reference30 articles.

1. Meng, X., and Aluç, G. (2021, January 19–22). Exploratory Data Analysis in SAP IQ Using Query-Time Sampling. Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece.

2. Du, Q.Q., Gao, G., Jin, Z.D., Li, W., and Chen, X.Y. (2012, January 14–16). Application of monte carlo simulation in reliability and validity evaluation of two-stage cluster sampling on multinomial sensitive question. Proceedings of the Information Computing and Applications: Third International Conference (ICICA 2012), Chengde, China. Proceedings 3.

3. Two-stage adaptive cluster sampling;Naddeo;Stat. Methods Appl.,2005

4. Adjusted two-stage adaptive cluster sampling;Muttlak;Environ. Ecol. Stat.,2002

5. Formally Verified Samplers from Probabilistic Programs with Loops and Conditioning;Bagnall;Proc. ACM Program. Lang.,2023