HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation-Reference-Cited by-同舟云学术

HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation

Published:2023-05-26 Issue:1 Volume:1 Page:1-26
ISSN:2836-6573
Container-title:Proceedings of the ACM on Management of Data
language:en
Short-container-title:Proc. ACM Manag. Data

Author:

Chen Sibei¹^ORCID,Tang Nan²^ORCID,Fan Ju¹^ORCID,Yan Xuemi¹^ORCID,Chai Chengliang³^ORCID,Li Guoliang⁴^ORCID,Du Xiaoyong¹^ORCID

Affiliation:

1. Renmin University of China, Beijing, China

2. QCRI, Doha, Qatar

3. Beijing Institute of Technology, Beijing, China

4. Tsinghua University, Beijing, China

Abstract

Data preparation is crucial in achieving optimized results for machine learning (ML). However, having a good data preparation pipeline is highly non-trivial for ML practitioners, which is not only domain-specific, but also dataset-specific. There are two common practices. Human-generated pipelines (HI-pipelines) typically use a wide range of any operations or libraries but are highly experience- and heuristic-based. In contrast, machine-generated pipelines (AI-pipelines), a.k.a. AutoML, often adopt a predefined set of sophisticated operations and are search-based and optimized. These two common practices are mutually complementary. In this paper, we study a new problem that, given an HI-pipeline and an AI-pipeline for the same ML task, can we combine them to get a new pipeline (HAI-pipeline) that is better than the provided HI-pipeline and AI-pipeline? We propose HAIPipe, a framework to address the problem, which adopts an enumeration-sampling strategy to carefully select the best performing combined pipeline. We also introduce a reinforcement learning (RL) based approach to search an optimized AI-pipeline. Extensive experiments using 1400+ real-world HI-pipelines (Jupyter notebooks from Kaggle) verify that HAIPipe can significantly outperform the approaches using either HI-pipelines or AI-pipelines alone.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3588945

Reference38 articles.

1. Laure Berti-É quille. 2019 a. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation . In The World Wide Web Conference, WWW 2019 , San Francisco, CA, USA, May 13--17 , 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 2580--2586. https://doi.org/10.1145/3308558.3313602 10.1145/3308558.3313602 Laure Berti-É quille. 2019a. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13--17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 2580--2586. https://doi.org/10.1145/3308558.3313602

2. Laure Berti-É quille. 2019 b. Reinforcement Learning for Data Preparation with Active Reward Learning. In Internet Science - 6th International Conference , INSCI 2019, Perpignan, France, December 2--5, 2019, Proceedings (Lecture Notes in Computer Science , Vol. 11938), Samira El Yacoubi, Franco Bagnoli, and Giovanna Pacini (Eds.). Springer, 121-- 132 . https://doi.org/10.1007/978--3-030--34770--3_10 10.1007/978--3-030--34770--3_10 Laure Berti-É quille. 2019b. Reinforcement Learning for Data Preparation with Active Reward Learning. In Internet Science - 6th International Conference, INSCI 2019, Perpignan, France, December 2--5, 2019, Proceedings (Lecture Notes in Computer Science, Vol. 11938), Samira El Yacoubi, Franco Bagnoli, and Giovanna Pacini (Eds.). Springer, 121--132. https://doi.org/10.1007/978--3-030--34770--3_10

3. Towards Interactive Curation & Automatic Tuning of ML Pipelines

4. Wenbin Cai , Ya Zhang , and Jun Zhou . 2013 . Maximizing Expected Model Change for Active Learning in Regression. In 2013 IEEE 13th International Conference on Data Mining , Dallas, TX, USA, December 7--10 , 2013, Hui Xiong, George Karypis, Bhavani Thuraisingham, Diane J. Cook, and Xindong Wu (Eds.). IEEE Computer Society, 51--60. https://doi.org/10.1109/ICDM.2013.104 10.1109/ICDM.2013.104 Wenbin Cai, Ya Zhang, and Jun Zhou. 2013. Maximizing Expected Model Change for Active Learning in Regression. In 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7--10, 2013, Hui Xiong, George Karypis, Bhavani Thuraisingham, Diane J. Cook, and Xindong Wu (Eds.). IEEE Computer Society, 51--60. https://doi.org/10.1109/ICDM.2013.104

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. ChatPipe: Orchestrating Data Preparation Pipelines by Optimizing Human-ChatGPT Interactions;Companion of the 2024 International Conference on Management of Data;2024-06-09

2. Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations;Proceedings of the ACM on Management of Data;2024-05-29

3. Efficient Relaxed Functional Dependency Discovery with Minimal Set Cover;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

4. Effective Entry-Wise Flow for Molecule Generation;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

5. Towards automating microservices orchestration through data-driven evolutionary architectures;Service Oriented Computing and Applications;2024-02-27