Affiliation:
1. The United Institute of Informatics Problems of the National Academy of Sciences of Belarus
Abstract
When applying classifiers in real applications, the data imbalance often occurs when the number of elements of one class is greater than another. The article examines the estimates of the classification results for this type of data. The paper provides answers to three questions: which term is a more accurate translation of the phrase "confusion matrix", how preferable to represent data in this matrix, and what functions to be better used to evaluate the results of classification by such a matrix. The paper demonstrates on real data that the popular accuracy function cannot correctly estimate the classification errors for imbalanced data. It is also impossible to compare the values of this function, calculated by matrices with absolute quantitative results of classification and normalized by classes. If the data is imbalanced, the accuracy calculated from the confusion matrix with normalized values will usually have lower values, since it is calculated by a different formula. The same conclusion is made for most of the classification accuracy functions used in the literature for estimation of classification results. It is shown that to represent confusion matrices it is better to use absolute values of object distribution by classes instead of relative ones, since they give an idea of the amount of data tested for each class and their imbalance. When constructing classifiers, it is recommended to evaluate errors by functions that do not depend on the data imbalance, that allows to hope for more correct classification results for real data.
Publisher
United Institute of Informatics Problems of the National Academy of Sciences of Belarus
Subject
General Earth and Planetary Sciences,General Environmental Science
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献