An Extensive Study on Cross-Dataset Bias and Evaluation Metrics Interpretation for Machine Learning Applied to Gastrointestinal Tract Abnormality Classification-Reference-Cited by-同舟云学术

An Extensive Study on Cross-Dataset Bias and Evaluation Metrics Interpretation for Machine Learning Applied to Gastrointestinal Tract Abnormality Classification

Published:2020-07-31 Issue:3 Volume:1 Page:1-29
ISSN:2691-1957
Container-title:ACM Transactions on Computing for Healthcare
language:en
Short-container-title:ACM Trans. Comput. Healthcare

Author:

Thambawita Vajira¹,Jha Debesh²^ORCID,Hammer Hugo Lewi³,Johansen Håvard D.⁴,Johansen Dag⁴,Halvorsen Pål¹,Riegler Michael A.⁵

Affiliation:

1. SimulaMet and Oslo Metropolitan University, Oslo, Norway

2. SimulaMet and UiT—The Arctic University of Norway, Tromsø, Norway

3. Oslo Metropolitan University and SimulaMet, Oslo, Norway

4. UiT—The Arctic University of Norway, Tromsø, Norway

5. SimulaMet, Oslo, Norway

Abstract

Precise and efficient automated identification of gastrointestinal (GI) tract diseases can help doctors treat more patients and improve the rate of disease detection and identification. Currently, automatic analysis of diseases in the GI tract is a hot topic in both computer science and medical-related journals. Nevertheless, the evaluation of such an automatic analysis is often incomplete or simply wrong. Algorithms are often only tested on small and biased datasets, and cross-dataset evaluations are rarely performed. A clear understanding of evaluation metrics and machine learning models with cross datasets is crucial to bring research in the field to a new quality level. Toward this goal, we present comprehensive evaluations of five distinct machine learning models using global features and deep neural networks that can classify 16 different key types of GI tract conditions, including pathological findings, anatomical landmarks, polyp removal conditions, and normal findings from images captured by common GI tract examination instruments. In our evaluation, we introduce performance hexagons using six performance metrics, such as recall, precision, specificity, accuracy, F1-score, and the Matthews correlation coefficient to demonstrate how to determine the real capabilities of models rather than evaluating them shallowly. Furthermore, we perform cross-dataset evaluations using different datasets for training and testing. With these cross-dataset evaluations, we demonstrate the challenge of actually building a generalizable model that could be used across different hospitals. Our experiments clearly show that more sophisticated performance metrics and evaluation methods need to be applied to get reliable models rather than depending on evaluations of the splits of the same dataset—that is, the performance metrics should always be interpreted together rather than relying on a single metric.

Funder

Norges Forskningsråd

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3386295

Reference83 articles.

Cited by 40 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Deep feature analysis, classification with AI-driven gastrointestinal diagnostics;MATEC Web of Conferences;2024

2. Research on Short-Term Prediction Method of Photovoltaic Power Generation Based on Improved Snake Optimization Algorithm for Optimizing Gate Recurrent Unit;2023 5th International Conference on Smart Power & Internet Energy Systems (SPIES);2023-12-01

3. A Deep Diagnostic Framework Using Explainable Artificial Intelligence and Clustering;Diagnostics;2023-11-09

4. A systematic review on intracranial aneurysm and hemorrhage detection using machine learning and deep learning techniques;Progress in Biophysics and Molecular Biology;2023-10

5. Leveraging physiology and artificial intelligence to deliver advancements in health care;Physiological Reviews;2023-10-01