Performance drift is a major barrier to the safe use of machine learning in cardiac surgery

Author:

Dong TimORCID,Sinha Shubhra,Zhai Ben,Fudulu Daniel P,Chan Jeremy,Narayan Pradeep,Judge Andy,Caputo Massimo,Dimagli Arnaldo,Benedetto Umberto,Angelini Gianni D.

Abstract

ABSTRACTObjectivesThe Society of Thoracic Surgeons (STS), and EuroSCORE II (ES II) risk scores, are the most commonly used risk prediction models for adult cardiac surgery post-operative in-hospital mortality. However, they are prone to miscalibration over time, and poor generalisation across datasets and their use remain controversial. It has been suggested that using Machine Learning (ML) techniques, a branch of Artificial intelligence (AI), may improve the accuracy of risk prediction. Despite increased interest, a gap in understanding the effect of dataset drift on the performance of ML over time remains a barrier to its wider use in clinical practice. Dataset drift occurs when a machine learning system underperforms because of a mismatch between the dataset it was developed and the data on which it is deployed. Here we analyse this potential concern in a large United Kingdom (UK) database.MethodsA retrospective analyses of prospectively routinely gathered data on adult patients undergoing cardiac surgery in the UK between 2012-2019. We temporally split the data 70:30 into a training and validation subset. ES II and five ML mortality prediction models were assessed for relationships between and within variable importance drift, performance drift and actual dataset drift using temporal and non-temporal invariant consensus scoring, combining geometric average results of all metrics as the Clinical Effective Metric (CEM).ResultsA total of 227,087 adults underwent cardiac surgery during the study period with a mortality rate of 2.76%. There was a strong evidence of decrease in overall performance across all models (p < 0.0001). Xgboost (CEM 0.728 95CI: 0.728-0.729) and Random Forest (CEM 0.727 95CI 0.727-0.728) were the best overall performing models both temporally and non-temporally. ES II perfomed worst across all comparisons. Sharp changes in variable importance and dataset drift between 2017-10 to 2017-12, 2018-06 to 2018-07 and 2018-12 to 2019-02 mirrored effects of performance decrease across models.ConclusionsCombining the metrics covering all four aspects of discrimination, calibration, clinical usefulness and overall accuracy into a single consensus metric improved the efficiency of cognitive decision-making. All models show a decrease in at least 3 of the 5 individual metrics. CEM and variable importance drift detection demonstrate the limitation of logistic regression methods used for cardiac surgery risk prediction and the effects of dataset drift. Future work will be required to determine the interplay between ML and whether ensemble models could take advantage of their respective performance advantages.Central messageML performance decreases over time due to dataset drift, but remains superior to ES II. Therefore regular assessment and modification of ML models may be preferable.Prospective messageA gap in understanding the effect of dataset drift on the performance of ML models over time presents a major barrier to their clinical application. Xgboost and Random Forest have shown superior performance both temporally and non-temporally against ES II. However, a decrease in model performance of all models due to dataset drift suggests the need for regular drift monitoring.

Publisher

Cold Spring Harbor Laboratory

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3