Comparing penalization methods for linear models on large observational health data-Reference-Cited by-同舟云学术

Comparing penalization methods for linear models on large observational health data

Published:2024-05-20 Issue:7 Volume:31 Page:1514-1521
ISSN:1067-5027
Container-title:Journal of the American Medical Informatics Association
language:en
Short-container-title:

Author:

Fridgeirsson Egill A¹^ORCID,Williams Ross¹^ORCID,Rijnbeek Peter¹,Suchard Marc A²³^ORCID,Reps Jenna M¹⁴^ORCID

Affiliation:

1. Department of Medical Informatics, Erasmus University Medical Center , 3015 GD Rotterdam, The Netherlands

2. Department of Biostatistics, University of California, Los Angeles , Los Angeles, CA 90095-1772, United States

3. VA Informatics and Computing Infrastructure, United States Department of Veterans Affairs , Salt Lake City, UT 84148, United States

4. Observational Health Data Analytics, Janssen Research and Development , Titusville, NJ 08560, United States

Abstract

Abstract Objective This study evaluates regularization variants in logistic regression (L1, L2, ElasticNet, Adaptive L1, Adaptive ElasticNet, Broken adaptive ridge [BAR], and Iterative hard thresholding [IHT]) for discrimination and calibration performance, focusing on both internal and external validation. Materials and Methods We use data from 5 US claims and electronic health record databases and develop models for various outcomes in a major depressive disorder patient population. We externally validate all models in the other databases. We use a train-test split of 75%/25% and evaluate performance with discrimination and calibration. Statistical analysis for difference in performance uses Friedman’s test and critical difference diagrams. Results Of the 840 models we develop, L1 and ElasticNet emerge as superior in both internal and external discrimination, with a notable AUC difference. BAR and IHT show the best internal calibration, without a clear external calibration leader. ElasticNet typically has larger model sizes than L1. Methods like IHT and BAR, while slightly less discriminative, significantly reduce model complexity. Conclusion L1 and ElasticNet offer the best discriminative performance in logistic regression for healthcare predictions, maintaining robustness across validations. For simpler, more interpretable models, L0-based methods (IHT and BAR) are advantageous, providing greater parsimony and calibration with fewer features. This study aids in selecting suitable regularization techniques for healthcare prediction models, balancing performance, complexity, and interpretability.

Funder

Innovative Medicines Initiative 2 Joint Undertaking

European Union’s Horizon 2020

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/jamia/article-pdf/31/7/1514/58243644/ocae109.pdf

Reference34 articles.

1. Trends in the conduct and reporting of clinical prediction model development and validation: a systematic review;Yang;J Am Med Inform Assoc,2022

2. Regression shrinkage and selection via the LASSO;Tibshirani;J R Stat Soc B,1996

3. A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data;Khalid;Comput Methods Programs Biomed,2021

4. External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination;Siontis;J Clin Epidemiol,2015

5. Massive parallelization of serial inference algorithms for a complex generalized linear model;Suchard;ACM Trans Model Comput Simul,2013