BACKGROUND
Preeclampsia represents a significant challenge in obstetrics. Effective early prediction is crucial for timely intervention, yet the development of predictive models is complicated by the class imbalances inherent in clinical data.
OBJECTIVE
This study aims to develop a robust pipeline that enhances the predictive performance of ensemble machine learning models for the early prediction of preeclampsia in an imbalanced dataset.
METHODS
We evaluated combinations of six ensemble machine learning algorithms and eight resampling techniques across a spectrum of minority-to-majority ratios. Using statistical methods, we systematically identified and optimized these configurations, focusing on key performance metrics such as Geometric Mean.
RESULTS
The strategic optimization of variable selection and settings proved crucial. The configuration using the Inverse Weighted Gaussian Mixture Model for resampling, followed by the Gradient Boosting Decision Trees algorithm, with an optimized minority-to-majority ratio of 0.09, was identified as the most effective, achieving a Geometric Mean of 0.6694. This configuration significantly outperformed the baseline across all evaluated metrics, demonstrating substantial improvements in model performance.
CONCLUSIONS
This study establishes a robust pipeline that significantly enhances the predictive performance of models for preeclampsia within imbalanced datasets. Our findings underscore the importance of a strategic approach to variable optimization in medical diagnostics, offering potential for broad application in various medical contexts where class imbalance is a concern.