Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results-Reference-Cited by-同舟云学术

Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results

Published:2021-08-12 Issue:8 Volume:16 Page:e0256152
ISSN:1932-6203
Container-title:PLOS ONE
language:en
Short-container-title:PLoS ONE

Author:

An Chansik,Park Yae Won^ORCID,Ahn Sung Soo,Han Kyunghwa^ORCID,Kim Hwiyoung,Lee Seung-Koo

Abstract

This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) “Simple” task, glioblastomas [n = 109] vs. brain metastasis [n = 58] and (2) “difficult” task, low- [n = 163] vs. high-grade [n = 95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training-test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained and evaluated using various validation methods in the training set, and tested in the test set, using the area under the curve (AUC) as an evaluation metric. The AUCs in training and testing varied among different training-test set pairs, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between training and testing was 0.039 (±0.032) for the simple task without undersampling and 0.092 (±0.071) for the difficult task with undersampling. In a training-test set pair with the difficult task without undersampling, for example, the AUC was high in training but much lower in testing (0.882 and 0.667, respectively); in another dataset pair with the same task, however, the AUC was low in training but much higher in testing (0.709 and 0.911, respectively). When the AUC discrepancy between training and test, or generalization gap, was large, none of the validation methods helped sufficiently reduce the generalization gap. Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes.

Funder

Korea Basic Science Institute

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference33 articles.

1. Towards clinical application of image mining: a systematic review on artificial intelligence and radiomics;M Sollini;Eur J Nucl Med Mol I,2019

2. National Library of Medicine. PubMed. [cited 20 May 2021]. https://pubmed.ncbi.nlm.nih.gov/

3. The Unreasonable Effectiveness of Data;A Halevy;IEEE Intell Syst,2009

4. Quality of science and reporting of radiomics in oncologic studies: room for improvement according to radiomics quality score and TRIPOD statement;JE Park;Eur Radiol,2020

5. External validation of prognostic models: what, why, how, when and where?;CL Ramspek;Clin Kidney J,2021

Cited by 40 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Towards discovery and implementation of neurophysiologic biomarkers of Alzheimer’s disease using entropy methods;Neuroscience;2024-10

2. Houston, We Have AI Problem! Quality Issues with Neuroimaging‐Based Artificial Intelligence in Parkinson's Disease: A Systematic Review;Movement Disorders;2024-09-05

3. Radiomics of multi-parametric MRI for the prediction of lung metastasis in soft-tissue sarcoma: a feasibility study;Cancer Imaging;2024-09-05

4. Radiomics features outperform standard radiological measurements in detecting femoroacetabular impingement on three‐dimensional magnetic resonance imaging;Journal of Orthopaedic Research;2024-08-11

5. Using machine learning models to estimate Escherichia coli concentration in an irrigation pond from water quality and drone-based RGB imagery data;Water Research;2024-08