Abstract
AbstractThe Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our algorithm with 14 publicly available imbalanced datasets using three different Machine Learning (ML) algorithms and compared the performance of LoRAS, SMOTE and several SMOTE extensions that share the concept of using convex combinations of minority class data points for oversampling with LoRAS. We observed that LoRAS, on average generates better ML models in terms of F1-Score and Balanced accuracy. Another key observation is that while most of the extensions of SMOTE we have tested, improve the F1-Score with respect to SMOTE on an average, they compromise on the Balanced accuracy of a classification model. LoRAS on the contrary, improves both F1 Score and the Balanced accuracy thus produces better classification models. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.
Funder
Bundesministerium für Bildung und Forschung
Universität Rostock
Publisher
Springer Science and Business Media LLC
Subject
Artificial Intelligence,Software
Reference45 articles.
1. Aditsania, A., & Saonard, A. L. (2017). Handling imbalanced data in churn prediction using ADASYN and backpropagation algorithm. In 2017 3rd international conference on science in information technology (ICSITech) (pp. 533–536). https://doi.org/10.1109/ICSITech.2017.8257170.
2. Ah-Pine, J., & Soriano-Morales, E.-P. (2016). A study of synthetic oversampling for Twitter imbalanced sentiment analysis. In Workshop on interactions between data mining and natural language processing (DMNLP 2016) (Vol. 1646, pp. 17–24).
3. Anand, A., Pugalenthi, G., & Gary Suganthan, P. (2010). An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids, 39, 1385–1391. https://doi.org/10.1007/s00726-010-0595-2.
4. Barua, S., Islam, M. M., Yao, X., & Murase, K. (2014). Mwmote—majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineering, 26(2), 405–425. https://doi.org/10.1109/TKDE.2012.232.
5. Bellinger, C., Drummond, C., & Japkowicz, N. (2016). Beyond the boundaries of smote. In P. Frasconi, N. Landwehr, G. Manco, & J. Vreeken (Eds.), Machine learning and knowledge discovery in databases (pp. 248–263). Cham: Springer.
Cited by
82 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献