Abstract
ABSTRACTObjectivesThe Society of Thoracic Surgeons (STS), and EuroSCORE II (ES II) risk scores, are the most commonly used risk prediction models for adult cardiac surgery post-operative in-hospital mortality. However, they are prone to miscalibration over time, and poor generalisation across datasets and their use remain controversial. It has been suggested that using Machine Learning (ML) techniques, a branch of Artificial intelligence (AI), may improve the accuracy of risk prediction. Despite increased interest, a gap in understanding the effect of dataset drift on the performance of ML over time remains a barrier to its wider use in clinical practice. Dataset drift occurs when a machine learning system underperforms because of a mismatch between the dataset it was developed and the data on which it is deployed. Here we analyse this potential concern in a large United Kingdom (UK) database.MethodsA retrospective analyses of prospectively routinely gathered data on adult patients undergoing cardiac surgery in the UK between 2012-2019. We temporally split the data 70:30 into a training and validation subset. ES II and five ML mortality prediction models were assessed for relationships between and within variable importance drift, performance drift and actual dataset drift using temporal and non-temporal invariant consensus scoring, combining geometric average results of all metrics as the Clinical Effective Metric (CEM).ResultsA total of 227,087 adults underwent cardiac surgery during the study period with a mortality rate of 2.76%. There was a strong evidence of decrease in overall performance across all models (p < 0.0001). Xgboost (CEM 0.728 95CI: 0.728-0.729) and Random Forest (CEM 0.727 95CI 0.727-0.728) were the best overall performing models both temporally and non-temporally. ES II perfomed worst across all comparisons. Sharp changes in variable importance and dataset drift between 2017-10 to 2017-12, 2018-06 to 2018-07 and 2018-12 to 2019-02 mirrored effects of performance decrease across models.ConclusionsCombining the metrics covering all four aspects of discrimination, calibration, clinical usefulness and overall accuracy into a single consensus metric improved the efficiency of cognitive decision-making. All models show a decrease in at least 3 of the 5 individual metrics. CEM and variable importance drift detection demonstrate the limitation of logistic regression methods used for cardiac surgery risk prediction and the effects of dataset drift. Future work will be required to determine the interplay between ML and whether ensemble models could take advantage of their respective performance advantages.Central messageML performance decreases over time due to dataset drift, but remains superior to ES II. Therefore regular assessment and modification of ML models may be preferable.Prospective messageA gap in understanding the effect of dataset drift on the performance of ML models over time presents a major barrier to their clinical application. Xgboost and Random Forest have shown superior performance both temporally and non-temporally against ES II. However, a decrease in model performance of all models due to dataset drift suggests the need for regular drift monitoring.
Publisher
Cold Spring Harbor Laboratory
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献