Data Validation Utilizing Expert Knowledge and Shape Constraints

Author:

Bachinger Florian1ORCID,Ehrlinger Lisa2ORCID,Kronberger Gabriel3ORCID,Wöss Wolfram4ORCID

Affiliation:

1. Heuristic and Evolutionary Algorithms Laboratory, University of Applied Sciences Upper Austria, Hagenberg, Austria and Institute for Application-oriented Knowledge Processing (FAW), Johannes Kepler University Linz, Linz Austria

2. Institute for Application-oriented Knowledge Processing, Johannes Kepler University Linz, Linz, Austria and Data Analysis Team, Software Competence Center Hagenberg GmbH, Hagenberg, Austria

3. Heuristic and Evolutionary Algorithms Laboratory, University of Applied Sciences Upper Austria, Hagenberg Austria

4. Institute for Application-oriented Knowledge Processing, Johannes Kepler University Linz, Linz, Austria

Abstract

Data validation is a primary concern in any data-driven application, as undetected data errors may negatively affect machine learning models and lead to suboptimal decisions. Data quality issues are usually detected manually by experts, which becomes infeasible and uneconomical for large volumes of data. To enable automated data validation, we propose “shape constraint-based data validation,” a novel approach based on machine learning that incorporates expert knowledge in the form of shape constraints. Shape constraints can be used to describe expected (multivariate and nonlinear) patterns in valid data and enable the detection of invalid data that deviates from these expected patterns. Our approach can be divided into two steps: (1) shape-constrained prediction models are trained on data, and (2) their training error is analyzed to identify invalid data. The training error can be used as an indicator for invalid data because shape-constrained models can fit valid data better than invalid data. We evaluate the approach on a benchmark suite consisting of synthetic datasets, which we have published for benchmarking similar data validation approaches. Additionally, we demonstrate the capabilities of the proposed approach with a real-world dataset consisting of measurements from a friction test bench in an industrial setting. Our approach detects subtle data errors that are difficult to identify even for domain experts.

Funder

BMK, BMAW, and the State of Upper Austria in the frame of the SCCH competence center INTEGRATE

FFG COMET Competence Centers for Excellent Technologies Programme

Josef Ressel Center for Symbolic Regression by the Christian Doppler Research Association

Publisher

Association for Computing Machinery (ACM)

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3