Identifying Mislabeled Training Data-Reference-Cited by-同舟云学术

Identifying Mislabeled Training Data

Published:1999-08-01 Issue: Volume:11 Page:131-167
ISSN:1076-9757
Container-title:Journal of Artificial Intelligence Research
language:
Short-container-title:jair

Author:

Brodley C. E.,Friedl M. A.

Abstract

This paper presents a new approach to identifying and eliminating mislabeled training instances for supervised learning. The goal of this approach is to improve classification accuracies produced by learning algorithms by improving the quality of the training data. Our approach uses a set of learning algorithms to create classifiers that serve as noise filters for the training data. We evaluate single algorithm, majority vote and consensus filters on five datasets that are prone to labeling errors. Our experiments illustrate that filtering significantly improves classification accuracy for noise levels up to 30 percent. An analytical and empirical evaluation of the precision of our approach shows that consensus filters are conservative at throwing away good data at the expense of retaining bad data and that majority filters are better at detecting bad data at the expense of throwing away good data. This suggests that for situations in which there is a paucity of data, consensus filters are preferable, whereas majority vote filters are preferable for situations with an abundance of data.

Publisher

AI Access Foundation

Subject

Artificial Intelligence

Cited by 568 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A histogram SMOTE-based sampling algorithm with incremental learning for imbalanced data classification;Information Sciences;2025-01

2. Advanced EOR screening methodology based on LightGBM and random forest: A classification problem with imbalanced data;The Canadian Journal of Chemical Engineering;2024-08-08

3. Label noise correction for crowdsourcing using dynamic resampling;Engineering Applications of Artificial Intelligence;2024-07

4. On the influence of metric learning loss functions for robust self-supervised speaker verification to label noise;2024 IEEE Conference on Artificial Intelligence (CAI);2024-06-25

5. Application of Natural Language Processing and Genetic Algorithm to Fine-Tune Hyperparameters of Classifiers for Economic Activities Analysis;Big Data and Cognitive Computing;2024-06-13