Affiliation:
1. School of Mathematical Science, Heilongjiang University, Harbin 150080, China
Abstract
Imbalanced data classification is gaining importance in data mining and machine learning. The minority class recall rate requires special treatment in fields such as medical diagnosis, information security, industry, and computer vision. This paper proposes a new strategy and algorithm based on a cost-sensitive support vector machine to improve the minority class recall rate to 1 because the misclassification of even a few samples can cause serious losses in some physical problems. In the proposed method, the modification employs a margin compensation to make the margin lopsided, enabling decision boundary drift. When the boundary reaches a certain position, the minority class samples will be more generalized to achieve the requirement of a recall rate of 1. In the experiments, the effects of different parameters on the performance of the algorithm were analyzed, and the optimal parameters for a recall rate of 1 were determined. The experimental results reveal that, for the imbalanced data classification problem, the traditional definite cost classification scheme and the models classified using the area under the receiver operating characteristic curve criterion rarely produce results such as a recall rate of 1. The new strategy can yield a minority recall of 1 for imbalanced data as the loss of the majority class is acceptable; moreover, it improves the
-means index. The proposed algorithm provides superior performance in minority recall compared to the conventional methods. The proposed method has important practical significance in credit card fraud, medical diagnosis, and other areas.
Funder
Heilongjiang Province Statistical Science Project
Reference46 articles.
1. Handling imbalanced datasets: a review;S. Kotsiantis;GESTS International Transactions on Computing in Science and Engineering,2006
2. A hybrid evolutionary preprocessing method for imbalanced datasets
3. 10 CHALLENGING PROBLEMS IN DATA MINING RESEARCH
4. Machine learning from imbalanced data sets;F. Provost
5. Mining with rarity