Data Balancing Techniques for Predicting Student Dropout Using Machine Learning-Reference-Cited by-同舟云学术

Data Balancing Techniques for Predicting Student Dropout Using Machine Learning

Published:2023-02-27 Issue:3 Volume:8 Page:49
ISSN:2306-5729
Container-title:Data
language:en
Short-container-title:Data

Author:

Mduma Neema¹^ORCID

Affiliation:

1. Department of Information and Communication Sciences and Engineering, The Nelson Mandela African Institution of Science and Technology, Arusha P.O. Box 447, Tanzania

Abstract

Predicting student dropout is a challenging problem in the education sector. This is due to an imbalance in student dropout data, mainly because the number of registered students is always higher than the number of dropout students. Developing a model without taking the data imbalance issue into account may lead to an ungeneralized model. In this study, different data balancing techniques were applied to improve prediction accuracy in the minority class while maintaining a satisfactory overall classification performance. Random Over Sampling, Random Under Sampling, Synthetic Minority Over Sampling, SMOTE with Edited Nearest Neighbor and SMOTE with Tomek links were tested, along with three popular classification models: Logistic Regression, Random Forest, and Multi-Layer Perceptron. Publicly accessible datasets from Tanzania and India were used to evaluate the effectiveness of balancing techniques and prediction models. The results indicate that SMOTE with Edited Nearest Neighbor achieved the best classification performance on the 10-fold holdout sample. Furthermore, Logistic Regression correctly classified the largest number of dropout students (57348 for the Uwezo dataset and 13430 for the India dataset) using the confusion matrix as the evaluation matrix. The applications of these models allow for the precise prediction of at-risk students and the reduction of dropout rates.

Funder

Canada’s International Development Research Centre, Ottawa, Canada and the Swedish International Development Cooperation Agency

Publisher

MDPI AG

Subject

Information Systems and Management,Computer Science Applications,Information Systems

Link

https://www.mdpi.com/2306-5729/8/3/49/pdf

Reference91 articles.

1. Class-imbalanced classifiers for high-dimensional data;Lin;Brief. Bioinform.,2013

2. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics;Palade;Inf. Sci.,2013

3. Krawczyk, B. (2015). Proceedings of the 9th International Conference on Computer Recognition Systems CORES 2015, Springer International Publishing.

4. Galar, M., Fernández, A., Barrenechea, E., Bustince, H., and Herrera, F. (2016). Proceedings of the 9th International Conference on Computer Recognition Systems CORES 2015, Springer International Publishing.

5. Learning from imbalanced data: Open challenges and future directions;Krawczyk;Prog. Artif. Intell.,2016

Cited by 11 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Thermal and visual comforts of occupants for a naturally ventilated educational building in low-income economies: A machine learning approach;Journal of Building Engineering;2024-10

2. Optimised SMOTE-based Imbalanced Learning for Student Dropout Prediction;Arabian Journal for Science and Engineering;2024-07-09

3. A novel approach to mitigate academic underachievement in higher education: Feature selection, classifier performance, and interpretability in predicting student performance;International Journal of ADVANCED AND APPLIED SCIENCES;2024-05

4. Predicting College Dropout Rates using Machine Learning: A Student Success Initiative;2024 International Conference on Computing and Data Science (ICCDS);2024-04-26

5. Early prediction models and crucial factor extraction for first-year undergraduate student dropouts;Journal of Applied Research in Higher Education;2024-03-19