MisDetect: Iterative Mislabel Detection using Early Loss-Reference-Cited by-同舟云学术

MisDetect: Iterative Mislabel Detection using Early Loss

Published:2024-02 Issue:6 Volume:17 Page:1159-1172
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Deng Yuhao¹,Chai Chengliang¹,Cao Lei²,Tang Nan³,Wang Jiayi⁴,Fan Ju⁵,Yuan Ye¹,Wang Guoren¹

Affiliation:

1. Beijing Institute of Technology

2. University of Arizona/MIT

3. HKUST (GZ)

4. Tsinghua University

5. Renmin University of China

Abstract

Supervised machine learning (ML) models trained on data with mislabeled instances often produce inaccurate results due to label errors. Traditional methods of detecting mislabeled instances rely on data proximity, where an instance is considered mislabeled if its label is inconsistent with its neighbors. However, it often performs poorly, because an instance does not always share the same label with its neighbors. ML-based methods instead utilize trained models to differentiate between mislabeled and clean instances. However, these methods struggle to achieve high accuracy, since the models may have already overfitted mislabeled instances. In this paper, we propose a novel framework, MisDetect, that detects mislabeled instances during model training. MisDetect leverages the early loss observation to iteratively identify and remove mislabeled instances. In this process, influence-based verification is applied to enhance the detection accuracy. Moreover, MisDetect automatically determines when the early loss is no longer effective in detecting mislabels such that the iterative detection process should terminate. Finally, for the training instances that MisDetect is still not certain about whether they are mislabeled or not, MisDetect automatically produces some pseudo labels to learn a binary classification model and leverages the generalization ability of the machine learning model to determine their status. Our experiments on 15 datasets show that MisDetect outperforms 10 baseline methods, demonstrating its effectiveness in detecting mislabeled instances.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.14778/3648160.3648161

Reference60 articles.

1. 1998. https://archive.ics.uci.edu/ml/datasets/Covertype.

2. 1999. https://yann.lecun.com/exdb/mnist/.

3. 2009. http://www.cs.toronto.edu/~kriz/cifar.html.

4. 2011. http://ufldl.stanford.edu/housenumbers/.

5. 2023. https://www.kaggle.com/datasets/ghassenkhaled/wine-quality-data.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. IDE: A System for Iterative Mislabel Detection;Companion of the 2024 International Conference on Management of Data;2024-06-09

2. LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes;Proceedings of the VLDB Endowment;2024-04

3. Outlier Summarization via Human Interpretable Rules;Proceedings of the VLDB Endowment;2024-03