Author:
Chowdhury Mohammad Mihrab,Ayon Ragib Shahariar,Hossain Md Sakhawat
Abstract
AbstractDiabetes is a prevalent chronic condition that poses significant challenges to early diagnosis and identifying at-risk individuals. Machine learning plays a crucial role in diabetes detection by leveraging its ability to process large volumes of data and identify complex patterns. However, imbalanced data, where the number of diabetic cases is substantially smaller than non-diabetic cases, complicates the identification of individuals with diabetes using machine learning algorithms. Our study focuses on predicting whether a person is at risk of diabetes, considering the individual’s health and socio-economic conditions while mitigating the challenges posed by imbalanced data. To minimize the impact of imbalance data, we employed several data augmentation techniques such as oversampling (SMOTE-N), undersampling (ENN), and hybrid sampling techniques (SMOTE-Tomek and SMOTE-ENN) on training data before applying machine learning algorithms. Our study sheds light on the significance of carefully utilizing data augmentation techniques, without any data leakage, in enhancing the effectiveness of machine learning algorithms. Moreover, it offers a complete machine learning structure for healthcare practitioners, from data obtaining to ML prediction, enabling them to make data-informed strategies.
Publisher
Cold Spring Harbor Laboratory