DeSQL: Interactive Debugging of SQL in Data-Intensive Scalable Computing-Reference-Cited by-同舟云学术

DeSQL: Interactive Debugging of SQL in Data-Intensive Scalable Computing

Published:2024-07-12 Issue:FSE Volume:1 Page:767-788
ISSN:2994-970X
Container-title:Proceedings of the ACM on Software Engineering
language:en
Short-container-title:Proc. ACM Softw. Eng.

Author:

Haroon Sabaat¹^ORCID,Brown Chris¹^ORCID,Gulzar Muhammad Ali¹^ORCID

Affiliation:

1. Virginia Tech, Blacksburg, USA

Abstract

SQL is the most commonly used front-end language for data-intensive scalable computing (DISC) applications due to its broad presence in new and legacy workflows and shallow learning curve. However, DISC-backed SQL introduces several layers of abstraction that significantly reduce the visibility and transparency of workflows, making it challenging for developers to find and fix errors in a query. When a query returns incorrect outputs, it takes a non-trivial effort to comprehend every stage of the query execution and find the root cause among the input data and complex SQL query. We aim to bring the benefits of step-through interactive debugging to DISC-powered SQL with DeSQL. Due to the declarative nature of SQL, there are no ordered atomic statements to place a breakpoint to monitor the flow of data. DeSQL’s automated query decomposition breaks a SQL query into its constituent sub queries, offering natural locations for setting breakpoints and monitoring intermediate data. However, due to advanced query optimization and translation in DISC systems, a user query rarely matches the physical execution, making it challenging to associate subqueries with their intermediate data. DeSQL performs fine-grained taint analysis to dynamically map the subqueries to their intermediate data, while also recognizing subqueries removed by the optimizers. For such subqueries, DeSQL efficiently regenerates the intermediate data from a nearby subquery’s data. On the popular TPC-DC benchmark, DeSQL provides a complete debugging view in 13% less time than the original job time while incurring an average overhead of 10% in addition to retaining Apache Spark’s scalability. In a user study comprising 15 participants engaged in two debugging tasks, we find that participants utilizing DeSQL identify the root cause behind a wrong query output in 74% less time than the de-facto, manual debugging.

Funder

National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3643761

Reference41 articles.

1. DataPlay

2. Spark SQL

3. Debugging data flows in reactive programs

4. Making Sense of Temporal Queries with Interactive Visualization

5. DBNotes