Affiliation:
1. Multiscale Networked Systems (MNS), Universiteit van Amsterdam, Amsterdam, Netherlands
2. Multiscale Networked Systems (MNS), Universiteit van Amsterdam, Amsterdam, Netherlands and LifeWatch ERIC Virtual Lab & Innovation Center (VLIC), Amsterdam, Netherlands
Abstract
Data quality plays a vital role in scientific research and decision-making across industries. Thus, it is crucial to incorporate the data quality control (DQC) process, which comprises various actions and operations to detect and correct data errors. The increasing adoption of machine learning (ML) techniques in different domains has raised concerns about data quality in the ML field. Conversely, ML’s capability to uncover complex patterns makes it suitable for addressing challenges involved in the DQC process. However, supervised learning methods demand abundant labeled data, while unsupervised learning methods heavily rely on the underlying distribution of the data. Active learning (AL) provides a promising solution by proactively selecting data points for inspection, thus reducing the burden of data labeling for domain experts. Therefore, this survey focuses on applying AL to DQC. Starting with a review of common data quality issues and solutions in the ML field, we aim to enhance the understanding of current quality assessment methods. We then present two scenarios to illustrate the adoption of AL into the DQC systems on the anomaly detection task, including pool-based and stream-based approaches. Finally, we provide the remaining challenges and research opportunities in this field.
Funder
European Union’s Horizon research and innovation program via the CLARIFY
BLUECLOUD 2026
ENVRI-FAIR
ENVRI-Hub Next
EVERSE
BioDT
Dutch research council via the LTER-LIFE project
Publisher
Association for Computing Machinery (ACM)
Reference193 articles.
1. Outlier detection by active learning
2. Charu C. Aggarwal. 2017. In data mining. In Outlier Analysis. Springer.
3. Data Classification
4. Active Learning for Deep Detection Neural Networks
5. A survey on learning from imbalanced data streams: Taxonomy, challenges, empirical study, and reproducible experimental framework;Aguiar Gabriel;Mach. Learn.,2023