Author:
Pias Tanmoy Sarkar,Su Yiqi,Tang Xuxin,Wang Haohui,Faghani Shahriar,Yao Danfeng (Daphne)
Abstract
AbstractMachine learning (ML) methodologies have gained significant traction in the realm of healthcare due to their capacity to enhance diagnosis, treatment, and patient outcomes. Nevertheless, mitigating bias within these models is imperative to ensure equitable healthcare regardless of demographic factors such as age, gender, and ethnicity. This study explores the effectiveness of various sampling strategies for balancing imbalanced datasets in the context of improving the accuracy of type 2 diabetes prediction. The investigation leverages multiple ML classifiers and applies them to the inherently imbalanced Behavioral Risk Factor Surveillance System (BRFSS) datasets. Three distinct ML algorithms, namely Logistic Regression, Random Forest, and Multilayer Perceptron, are assessed on both the original and resampled datasets. The study reveals that dataset balancing through undersampling and oversampling techniques significantly enhances the models’ sensitivity and balanced accuracy by at least 52% and 15%. However, it is observed that certain methods such as SMOTE, ADASYN, Tomek Links, Edited Nearest Distance, and Near Miss do not notably improve model sensitivity. Furthermore, this pattern of performance enhancement holds consistent when tested across multiple years of datasets (2021, 2019, 2017, and 2015). The analysis underscores that models trained on raw, imbalanced datasets exhibit subpar sensitivity across various subgroups, particularly among the White population (Sensitivity 0.17). The adoption of subgroup-based resampling techniques effectively ameliorates sensitivity and balanced accuracy by at least 45% and 10% respectively. Notably, the study identifies blood pressure, kidney disease, cholesterol levels, and BMI are the most important indicators of type 2 diabetes. This research underscores the potential of the resampling technique as a promising approach to developing more equitable, balanced, and accurate ML models, especially when addressing different disparities in healthcare outcomes.
Publisher
Cold Spring Harbor Laboratory
Reference58 articles.
1. CDC - BRFSS — cdc.gov. https://www.cdc.gov/brfss/. [Accessed 23-Apr-2023].
2. Diabetes — who.int. https://www.who.int/health-topics/diabetes#tab=tab_1. [Accessed 23-Apr-2023].
3. Two Modifications of CNN
4. Subpopulationspecific machine learning prognosis for underrepresented patients with double prioritized bias correction;Communications medicine,2022
5. Large-scale diet tracking data reveal disparate associations between food environment and diet;Nature communications,2022