Auto-pipeline-Reference-Cited by-同舟云学术

Auto-pipeline

Published:2021-07 Issue:11 Volume:14 Page:2563-2575
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Yang Junwen¹,He Yeye²,Chaudhuri Surajit²

Affiliation:

1. University of Chicago

2. Microsoft Research

Abstract

Recent work has made significant progress in helping users to automate single data preparation steps, such as string-transformations and table-manipulation operators (e.g., Join, GroupBy, Pivot, etc.). We in this work propose to automate multiple such steps end-to-end, by synthesizing complex data-pipelines with both string-transformations and table-manipulation operators. We propose a novel by-target paradigm that allows users to easily specify the desired pipeline, which is a significant departure from the traditional by-example paradigm. Using by-target, users would provide input tables (e.g., csv or json files), and point us to a "target table" (e.g., an existing database table or BI dashboard) to demonstrate how the output from the desired pipeline would schematically "look like". While the problem is seemingly under-specified, our unique insight is that implicit table constraints such as FDs and keys can be exploited to significantly constrain the space and make the problem tractable. We develop an AUTO-PIPELINE system that learns to synthesize pipelines using deep reinforcement-learning (DRL) and search. Experiments using a benchmark of 700 real pipelines crawled from GitHub and commercial vendors suggest that AUTO-PIPELINE can successfully synthesize around 70% of complex pipelines with up to 10 steps.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3476249.3476303

Cited by 14 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Auto-Tables: Relationalize Tables without Using Examples;ACM SIGMOD Record;2024-05-14

2. Higher-Order SQL Lambda Functions;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

3. DATALORE: Can a Large Language Model Find All Lost Scrolls in a Data Repository?;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

4. Gen-T: Table Reclamation in Data Lakes;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

5. KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13