HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation

Author:

Chen Sibei1ORCID,Tang Nan2ORCID,Fan Ju1ORCID,Yan Xuemi1ORCID,Chai Chengliang3ORCID,Li Guoliang4ORCID,Du Xiaoyong1ORCID

Affiliation:

1. Renmin University of China, Beijing, China

2. QCRI, Doha, Qatar

3. Beijing Institute of Technology, Beijing, China

4. Tsinghua University, Beijing, China

Abstract

Data preparation is crucial in achieving optimized results for machine learning (ML). However, having a good data preparation pipeline is highly non-trivial for ML practitioners, which is not only domain-specific, but also dataset-specific. There are two common practices. Human-generated pipelines (HI-pipelines) typically use a wide range of any operations or libraries but are highly experience- and heuristic-based. In contrast, machine-generated pipelines (AI-pipelines), a.k.a. AutoML, often adopt a predefined set of sophisticated operations and are search-based and optimized. These two common practices are mutually complementary. In this paper, we study a new problem that, given an HI-pipeline and an AI-pipeline for the same ML task, can we combine them to get a new pipeline (HAI-pipeline) that is better than the provided HI-pipeline and AI-pipeline? We propose HAIPipe, a framework to address the problem, which adopts an enumeration-sampling strategy to carefully select the best performing combined pipeline. We also introduce a reinforcement learning (RL) based approach to search an optimized AI-pipeline. Extensive experiments using 1400+ real-world HI-pipelines (Jupyter notebooks from Kaggle) verify that HAIPipe can significantly outperform the approaches using either HI-pipelines or AI-pipelines alone.

Publisher

Association for Computing Machinery (ACM)

Reference38 articles.

1. Laure Berti-É quille. 2019 a. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation . In The World Wide Web Conference, WWW 2019 , San Francisco, CA, USA, May 13--17 , 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 2580--2586. https://doi.org/10.1145/3308558.3313602 10.1145/3308558.3313602 Laure Berti-É quille. 2019a. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13--17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 2580--2586. https://doi.org/10.1145/3308558.3313602

2. Laure Berti-É quille. 2019 b. Reinforcement Learning for Data Preparation with Active Reward Learning. In Internet Science - 6th International Conference , INSCI 2019, Perpignan, France, December 2--5, 2019, Proceedings (Lecture Notes in Computer Science , Vol. 11938), Samira El Yacoubi, Franco Bagnoli, and Giovanna Pacini (Eds.). Springer, 121-- 132 . https://doi.org/10.1007/978--3-030--34770--3_10 10.1007/978--3-030--34770--3_10 Laure Berti-É quille. 2019b. Reinforcement Learning for Data Preparation with Active Reward Learning. In Internet Science - 6th International Conference, INSCI 2019, Perpignan, France, December 2--5, 2019, Proceedings (Lecture Notes in Computer Science, Vol. 11938), Samira El Yacoubi, Franco Bagnoli, and Giovanna Pacini (Eds.). Springer, 121--132. https://doi.org/10.1007/978--3-030--34770--3_10

3. Towards Interactive Curation & Automatic Tuning of ML Pipelines

4. Wenbin Cai , Ya Zhang , and Jun Zhou . 2013 . Maximizing Expected Model Change for Active Learning in Regression. In 2013 IEEE 13th International Conference on Data Mining , Dallas, TX, USA, December 7--10 , 2013, Hui Xiong, George Karypis, Bhavani Thuraisingham, Diane J. Cook, and Xindong Wu (Eds.). IEEE Computer Society, 51--60. https://doi.org/10.1109/ICDM.2013.104 10.1109/ICDM.2013.104 Wenbin Cai, Ya Zhang, and Jun Zhou. 2013. Maximizing Expected Model Change for Active Learning in Regression. In 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7--10, 2013, Hui Xiong, George Karypis, Bhavani Thuraisingham, Diane J. Cook, and Xindong Wu (Eds.). IEEE Computer Society, 51--60. https://doi.org/10.1109/ICDM.2013.104

Cited by 6 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. ChatPipe: Orchestrating Data Preparation Pipelines by Optimizing Human-ChatGPT Interactions;Companion of the 2024 International Conference on Management of Data;2024-06-09

2. Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations;Proceedings of the ACM on Management of Data;2024-05-29

3. Efficient Relaxed Functional Dependency Discovery with Minimal Set Cover;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

4. Effective Entry-Wise Flow for Molecule Generation;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

5. Towards automating microservices orchestration through data-driven evolutionary architectures;Service Oriented Computing and Applications;2024-02-27

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3