Optimizer’s dilemma: optimization strongly influences model selection in transcriptomic prediction-Reference-Cited by-同舟云学术

Optimizer’s dilemma: optimization strongly influences model selection in transcriptomic prediction

Published:2024-01-01 Issue:1 Volume:4 Page:
ISSN:2635-0041
Container-title:Bioinformatics Advances
language:en
Short-container-title:

Author:

Crawford Jake¹,Chikina Maria²,Greene Casey S³⁴^ORCID

Affiliation:

1. Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania , Philadelphia, PA 19104, United States

2. Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh , Pittsburgh, PA 15260, United States

3. Department of Biomedical Informatics, University of Colorado School of Medicine , Aurora, CO 80045, United States

4. Center for Health AI, University of Colorado School of Medicine , Aurora, CO 80045, United States

Abstract

Abstract Motivation Most models can be fit to data using various optimization approaches. While model choice is frequently reported in machine-learning-based research, optimizers are not often noted. We applied two different implementations of LASSO logistic regression implemented in Python’s scikit-learn package, using two different optimization approaches (coordinate descent, implemented in the liblinear library, and stochastic gradient descent, or SGD), to predict mutation status and gene essentiality from gene expression across a variety of pan-cancer driver genes. For varying levels of regularization, we compared performance and model sparsity between optimizers. Results After model selection and tuning, we found that liblinear and SGD tended to perform comparably. liblinear models required more extensive tuning of regularization strength, performing best for high model sparsities (more nonzero coefficients), but did not require selection of a learning rate parameter. SGD models required tuning of the learning rate to perform well, but generally performed more robustly across different model sparsities as regularization strength decreased. Given these tradeoffs, we believe that the choice of optimizers should be clearly reported as a part of the model selection and validation process, to allow readers and reviewers to better understand the context in which results have been generated. Availability and implementation The code used to carry out the analyses in this study is available at https://github.com/greenelab/pancancer-evaluation/tree/master/01_stratified_classification. Performance/regularization strength curves for all genes in the Vogelstein et al. (2013) dataset are available at https://doi.org/10.6084/m9.figshare.22728644.

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/bioinformaticsadvances/advance-article-pdf/doi/10.1093/bioadv/vbae004/56411343/vbae004.pdf

Reference36 articles.

1. Prediction of adjuvant chemotherapy benefit in endocrine responsive, early breast cancer using multigene assays;Albain;Breast,2009

2. Identification of phenocopies improves prediction of targeted therapy response over DNA mutations alone;Bakhtiar;NPJ Genom Med,2022

3. Benign overfitting in linear regression;Bartlett;Proc Natl Acad Sci USA,2020

4. Widespread redundancy in -omics profiles of cancer mutation states;Crawford;Genome Biol,2022

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Reconstruction of Eriocheir sinensis Protein–Protein Interaction Network Based on DGO-SVM Method;Current Issues in Molecular Biology;2024-07-12