SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications

Author:

Siddiqi Shafaq1ORCID,Kern Roman1ORCID,Boehm Matthias2ORCID

Affiliation:

1. Graz University of Technology, Graz, Austria

2. Technische Universität Berlin, Berlin, Germany

Abstract

In the exploratory data science lifecycle, data scientists often spent the majority of their time finding, integrating, validating and cleaning relevant datasets. Despite recent work on data validation, and numerous error detection and correction algorithms, in practice, data cleaning for ML remains largely a manual, unpleasant, and labor-intensive trial and error process, especially in large-scale, distributed computation. The target ML application---such as classification or regression models---can be used as a signal of valuable feedback though, for selecting effective data cleaning strategies. In this paper, we introduce SAGA, a framework for automatically generating the top-K most effective data cleaning pipelines. SAGA adopts ideas from Auto-ML, feature selection, and hyper-parameter tuning. Our framework is extensible for user-provided constraints, new data cleaning primitives, and ML applications; automatically generates hybrid runtime plans of local and distributed operations; and performs pruning by interesting properties (e.g., monotonicity). Instead of full automation---which is rather unrealistic---SAGA simplifies the mechanical aspects of data cleaning. Our experiments show that SAGA yields robust accuracy improvements over state-of-the-art, and good scalability regarding increasing data sizes and number of evaluated pipelines.

Publisher

Association for Computing Machinery (ACM)

Reference128 articles.

1. Ziawasch Abedjan Lukasz Golab Felix Naumann and Thorsten Papenbrock. 2018. Data Profiling. In Synthesis Lectures on Data Management. http://sites.computer.org/debull/A18june/p3.pdf Ziawasch Abedjan Lukasz Golab Felix Naumann and Thorsten Papenbrock. 2018. Data Profiling. In Synthesis Lectures on Data Management. http://sites.computer.org/debull/A18june/p3.pdf

2. Giorgos Alexiou , George Papastefanatos , Vassilis Stamatopoulos , Georgia Koutrika , and Nectarios Koziris . 2022. QueryER: A Framework for Fast Analysis-Aware Deduplication over Dirty Data. CoRR , Vol. abs/ 2202 .01546 ( 2022 ). showeprint[arXiv]2202.01546 https://arxiv.org/abs/2202.01546 Giorgos Alexiou, George Papastefanatos, Vassilis Stamatopoulos, Georgia Koutrika, and Nectarios Koziris. 2022. QueryER: A Framework for Fast Analysis-Aware Deduplication over Dirty Data. CoRR, Vol. abs/2202.01546 (2022). showeprint[arXiv]2202.01546 https://arxiv.org/abs/2202.01546

3. ASQ/ANSI/ISO. 2015. 9001:2015: Quality management systems - Requirements. ASQ/ANSI/ISO. 2015. 9001:2015: Quality management systems - Requirements.

4. TFX

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3