Abstract
ABSTRACTBackgroundDeep learning models may combat widening racial disparities in heart failure outcomes through early identification of individuals at high risk. However, demographic biases in the performance of these models have not been well studied.MethodsThis retrospective analysis used 12-lead ECGs taken between 2008 - 2018 from 290,252 patients referred for standard clinical indications to Stanford Hospital. The primary model was a convolutional neural network model trained to predict incident heart failure within 5 years. Biases were evaluated on the testing set (160,312 ECGs) using area under the receiver operating curve (AUC), stratified across the protected attributes of race, ethnicity, age, and sex.Results50,956 incident cases of heart failure were observed within 5 years of ECG collection. The performance of the primary model declined with age. There were no significant differences observed between racial groups overall. However, the primary model performed significantly worse in Black patients aged 0 - 40 compared to all other racial groups in this age group, with differences most pronounced among young Black women. Disparities in model performance did not improve with integration of race, ethnicity, gender, and/or age into model architecture, by training separate models for each racial group, nor by providing the model with a dataset of equal racial representation. Using probability thresholds individualized for race, age, and gender offered substantial improvements in F1-scores.ConclusionThe biases found in this study warrant caution against perpetuating disparities through the development of machine learning tools for the prognosis and management of heart failure. Customizing the application of these models by using probability thresholds individualized by race/ethnicity, age, and sex may offer an avenue to mitigate existing algorithmic disparities.
Publisher
Cold Spring Harbor Laboratory