Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI-Reference-Cited by-同舟云学术

Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI

Published:2024 Issue: Volume: Page:477-524
ISSN:1431-1917
Container-title:Health Informatics
language:en
Short-container-title:

Author:

Aliferis Constantin,Simon Gyorgy

Abstract

AbstractAvoiding over and under fitted analyses (OF, UF) and models is critical for ensuring as high generalization performance as possible and is of profound importance for the success of ML/AI modeling. In modern ML/AI practice OF/UF are typically interacting with error estimator procedures and model selection, as well as with sampling and reporting biases and thus need be considered together in context. The more general situations of over confidence (OC) about models and/or under-performing (UP) models can occur in many subtle and not so subtle ways especially in the presence of high-dimensional data, modest or small sample sizes, powerful learners and imperfect data designs. Because over/under confidence about models are closely related to model complexity, model selection, error estimation and sampling (as part of data design) we connect these concepts with the material of chapters “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science,” “Data Design,” and “Evaluation”. These concepts are also closely related to statistical significance and scientific reproducibility. We examine several common scenarios where over confidence in model performance and/or model under performance occur as well as detailed practices for preventing, testing and correcting them.

Publisher

Springer International Publishing

Link

https://link.springer.com/content/pdf/10.1007/978-3-031-39355-6_10

Reference19 articles.

1. Mitchell TM. Machine learning, vol. 1. New York: McGraw-hill; 2007.

2. Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst. 2003;95(1):14–8.

3. Statnikov A. A gentle introduction to support vector machines in biomedicine: theory and methods, vol. 1. World Scientific; 2011.

4. Aliferis CF, Statnikov A, Tsamardinos I. Challenges in the analysis of mass-throughput data: a technical commentary from the statistical machine learning perspective. Cancer Informat. 2006;2:133–62.

5. Aliferis CF, Statnikov A, Tsamardinos I, Schildcrout JS, Shepherd BE, Harrell FE Jr. Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data. PLoS One. 2009;4(3):e4922.

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A comprehensive review of artificial intelligence for pharmacology research;Frontiers in Genetics;2024-09-03

2. Innovative Approaches in Residential Solar Electricity: Forecasting and Fault Detection Using Machine Learning;Electricity;2024-08-24

3. Application of Fast MEEMD–ConvLSTM in Sea Surface Temperature Predictions;Remote Sensing;2024-07-05

4. Personalizing Obstructive Sleep Apnea Therapy Using Machine Learning: Insights from the ISAACC Trial;Annals of the American Thoracic Society;2024-07

5. Spatial Downscaling of Satellite-Based Soil Moisture Products Using Machine Learning Techniques: A Review;Remote Sensing;2024-06-07