The impact of inconsistent human annotations on AI driven clinical decision making-Reference-Cited by-同舟云学术

The impact of inconsistent human annotations on AI driven clinical decision making

Published:2023-02-21 Issue:1 Volume:6 Page:
ISSN:2398-6352
Container-title:npj Digital Medicine
language:en
Short-container-title:npj Digit. Med.

Author:

Sylolypavan Aneeta,Sleeman Derek^ORCID,Wu Honghan^ORCID,Sim Malcolm

Abstract

AbstractIn supervised learning model development, domain experts are often used to provide the class labels (annotations). Annotation inconsistencies commonly occur when even highly experienced clinical experts annotate the same phenomenon (e.g., medical image, diagnostics, or prognostic status), due to inherent expert bias, judgments, and slips, among other factors. While their existence is relatively well-known, the implications of such inconsistencies are largely understudied in real-world settings, when supervised learning is applied on such ‘noisy’ labelled data. To shed light on these issues, we conducted extensive experiments and analyses on three real-world Intensive Care Unit (ICU) datasets. Specifically, individual models were built from a common dataset, annotated independently by 11 Glasgow Queen Elizabeth University Hospital ICU consultants, and model performance estimates were compared through internal validation (Fleiss’ κ = 0.383 i.e., fair agreement). Further, broad external validation (on both static and time series datasets) of these 11 classifiers was carried out on a HiRID external dataset, where the models’ classifications were found to have low pairwise agreements (average Cohen’s κ = 0.255 i.e., minimal agreement). Moreover, they tend to disagree more on making discharge decisions (Fleiss’ κ = 0.174) than predicting mortality (Fleiss’ κ = 0.267). Given these inconsistencies, further analyses were conducted to evaluate the current best practices in obtaining gold-standard models and determining consensus. The results suggest that: (a) there may not always be a “super expert” in acute clinical settings (using internal and external validation model performances as a proxy); and (b) standard consensus seeking (such as majority vote) consistently leads to suboptimal models. Further analysis, however, suggests that assessing annotation learnability and using only ‘learnable’ annotated datasets for determining consensus achieves optimal models in most cases.

Funder

DH | National Institute for Health Research

British Council

University of Edinburgh

RCUK | Medical Research Council

Alan Turing Institute

Publisher

Springer Science and Business Media LLC

Subject

Health Information Management,Health Informatics,Computer Science Applications,Medicine (miscellaneous)

Link

https://www.nature.com/articles/s41746-023-00773-3.pdf

Reference63 articles.

1. Bootkrajang, J. & Kabán, A. Multi-class Classification in the Presence of Labelling Errors. Proceedings of the 2011 European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2011), 345–350 (2011).

2. Cabitza, F., Ciucci, D. & Rasoini, R. A Giant with Feet of Clay: On the Validity of the Data that Feed Machine Learning in Medicine. Organ. Digital World 28, 121–136 (2019).

3. Mahato, D., Dudhal, D., Revagade, D. Bhargava, Y. A Method to Detect Inconsistent Annotations in a Medical Document using UMLS. Proceedings of the 11th Forum for Information Retrieval Evaluation. 47–51, https://doi.org/10.1145/3368567.3368577 (2019).

4. Garcia, L. P. F., De Carvalho, A. C. & Lorena, A. C. Effect of label noise in the complexity of classification problems. Neurocomputing 160, 108–119 (2015).

5. Sleeman, D., Kostadinov, K., Moss, L. & Sim, M. Resolving Differences of Opinion between Medical Experts: A Case Study with the IS-DELPHI System. Proc. 13th Int. Jt. Conf. Biomed. Eng. Syst. Technol. 5, 66–76 (2020).

Cited by 22 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Deep Learning for MRI Segmentation and Molecular Subtyping in Glioblastoma: Critical Aspects from an Emerging Field;Biomedicines;2024-08-16

2. On responsible machine learning datasets emphasizing fairness, privacy and regulatory norms with examples in biometrics and healthcare;Nature Machine Intelligence;2024-08-12

3. An Assessment of Contemporary Methods and Data-Enabled Approaches for Early Cataract Detection;F1000Research;2024-08-02

4. Quality over quantity? The role of data quality and uncertainty for AI in surgery;Global Surgical Education - Journal of the Association for Surgical Education;2024-08-01

5. Data annotation quality in smart farming industry;Production & Manufacturing Research;2024-07-10