The ability to classify patients based on gene-expression data varies by algorithm and performance metric-Reference-Cited by-同舟云学术

The ability to classify patients based on gene-expression data varies by algorithm and performance metric

Published:2022-03-11 Issue:3 Volume:18 Page:e1009926
ISSN:1553-7358
Container-title:PLOS Computational Biology
language:en
Short-container-title:PLoS Comput Biol

Author:

Piccolo Stephen R.^ORCID,Mecham Avery^ORCID,Golightly Nathan P.^ORCID,Johnson Jérémie L.,Miller Dustin B.^ORCID

Abstract

By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include patients with a particular disease subtype, patients with a good (or poor) prognosis, or patients most (or least) likely to respond to a particular therapy. Transcriptomic measurements reflect the downstream effects of genomic and epigenomic variations. However, high-throughput technologies generate thousands of measurements per patient, and complex dependencies exist among genes, so it may be infeasible to classify patients using traditional statistical models. Machine-learning classification algorithms can help with this problem. However, hundreds of classification algorithms exist—and most support diverse hyperparameters—so it is difficult for researchers to know which are optimal for gene-expression biomarkers. We performed a benchmark comparison, applying 52 classification algorithms to 50 gene-expression datasets (143 class variables). We evaluated algorithms that represent diverse machine-learning methodologies and have been implemented in general-purpose, open-source, machine-learning libraries. When available, we combined clinical predictors with gene-expression data. Additionally, we evaluated the effects of performing hyperparameter optimization and feature selection using nested cross validation. Kernel- and ensemble-based algorithms consistently outperformed other types of classification algorithms; however, even the top-performing algorithms performed poorly in some cases. Hyperparameter optimization and feature selection typically improved predictive performance, and univariate feature-selection algorithms typically outperformed more sophisticated methods. Together, our findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies.

Funder

Simmons Center for Cancer Research, Brigham Young University

Publisher

Public Library of Science (PLoS)

Subject

Computational Theory and Mathematics,Cellular and Molecular Neuroscience,Genetics,Molecular Biology,Ecology,Modeling and Simulation,Ecology, Evolution, Behavior and Systematics

Reference137 articles.

1. A New Initiative on Precision Medicine;FS Collins;N Engl J Med,2015

2. Big Data And New Knowledge In Medicine: The Thinking, Training, And Tools Needed For A Learning Health System.;HM Krumholz;Health Aff (Millwood),2014

3. Predicting the Future—Big Data, Machine Learning, and Clinical Medicine;Z Obermeyer;N Engl J Med,2016

4. The use and analysis of microarray data.;A. Butte;Nat Rev Drug Discov,2002

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Comprehensive Meta-Analysis of Breast Cancer Gene Expression;2024-09-02

2. Secondary Analysis of Human Bulk RNA-Seq Dataset Suggests Potential Mechanisms for Letrozole Resistance in Estrogen-Positive (ER+) Breast Cancer;Current Issues in Molecular Biology;2024-07-06

3. Development of a multigenomic liquid biopsy (PROSTest) for prostate cancer in whole blood;The Prostate;2024-04-03

4. Optimizer’s dilemma: optimization strongly influences model selection in transcriptomic prediction;Bioinformatics Advances;2024-01-01

5. Optimizer’s dilemma: optimization strongly influences model selection in transcriptomic prediction;2023-06-26