Abstract
AbstractMachine learning (ML) is an approach driven by data, and as research in machine learning progresses, the issue of noisy labels has garnered widespread attention. Noisy labels can significantly reduce the accuracy of supervised classification models, making it important to address this problem. Therefore, it is a very meaningful task to detect as many noisy labels as possible from the big data. In this study, a new method is proposed for detecting noisy labels in datasets. This method leverages a deep pre-trained network to extract a feature set from the image data first which can extract more accurate data features. Then, a membership degree based on tightness into the support vector data description (SVDD) model named TF-SVDD is introduced to detect noisy data in the dataset. In order to simulate different types of label noise more accurately, we first assumed that the labels of the datasets used were all correct, and in addition constructed the noise set using two method: the density peak noise set and the random noise set. Experimental results demonstrate that the TF-SVDD can effectively detect noisy label data, surpassing traditional support vector data description algorithms and other methods in terms of outlier detection accuracy, with the average accuracy mostly exceeding 50$$\%$$
%
, and even reaching 80$$\%$$
%
. Furthermore, one novel measure called ‘confidence’ is employed to rectify noisy labels in the data. Following the correction of noisy labels, the accuracy of image classification experiences a significant improvement, with the average promotion ratio mostly exceeding 10$$\%$$
%
, and reaching 30$$\%$$
%
.
Funder
National Natural Science Foundation of China
Natural Science Basic Research Program of Shaanxi Province
Fundamental Research Funds for the Central Universities
Publisher
Springer Science and Business Media LLC