Abstract
AbstractIn supervised learning model development, domain experts are often used to provide the class labels (annotations). Annotation inconsistencies commonly occur when even highly experienced clinical experts annotate the same phenomenon (e.g., medical image, diagnostics, or prognostic status), due to inherent expert bias, judgments, and slips, among other factors. While their existence is relatively well-known, the implications of such inconsistencies are largely understudied in real-world settings, when supervised learning is applied on such ‘noisy’ labelled data. To shed light on these issues, we conducted extensive experiments and analyses on three real-world Intensive Care Unit (ICU) datasets. Specifically, individual models were built from a common dataset, annotated independently by 11 Glasgow Queen Elizabeth University Hospital ICU consultants, and model performance estimates were compared through internal validation (Fleiss’ κ = 0.383 i.e., fair agreement). Further, broad external validation (on both static and time series datasets) of these 11 classifiers was carried out on a HiRID external dataset, where the models’ classifications were found to have low pairwise agreements (average Cohen’s κ = 0.255 i.e., minimal agreement). Moreover, they tend to disagree more on making discharge decisions (Fleiss’ κ = 0.174) than predicting mortality (Fleiss’ κ = 0.267). Given these inconsistencies, further analyses were conducted to evaluate the current best practices in obtaining gold-standard models and determining consensus. The results suggest that: (a) there may not always be a “super expert” in acute clinical settings (using internal and external validation model performances as a proxy); and (b) standard consensus seeking (such as majority vote) consistently leads to suboptimal models. Further analysis, however, suggests that assessing annotation learnability and using only ‘learnable’ annotated datasets for determining consensus achieves optimal models in most cases.
Funder
DH | National Institute for Health Research
British Council
University of Edinburgh
RCUK | Medical Research Council
Alan Turing Institute
Publisher
Springer Science and Business Media LLC
Subject
Health Information Management,Health Informatics,Computer Science Applications,Medicine (miscellaneous)
Reference63 articles.
1. Bootkrajang, J. & Kabán, A. Multi-class Classification in the Presence of Labelling Errors. Proceedings of the 2011 European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2011), 345–350 (2011).
2. Cabitza, F., Ciucci, D. & Rasoini, R. A Giant with Feet of Clay: On the Validity of the Data that Feed Machine Learning in Medicine. Organ. Digital World 28, 121–136 (2019).
3. Mahato, D., Dudhal, D., Revagade, D. Bhargava, Y. A Method to Detect Inconsistent Annotations in a Medical Document using UMLS. Proceedings of the 11th Forum for Information Retrieval Evaluation. 47–51, https://doi.org/10.1145/3368567.3368577 (2019).
4. Garcia, L. P. F., De Carvalho, A. C. & Lorena, A. C. Effect of label noise in the complexity of classification problems. Neurocomputing 160, 108–119 (2015).
5. Sleeman, D., Kostadinov, K., Moss, L. & Sim, M. Resolving Differences of Opinion between Medical Experts: A Case Study with the IS-DELPHI System. Proc. 13th Int. Jt. Conf. Biomed. Eng. Syst. Technol. 5, 66–76 (2020).
Cited by
22 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献