Baseline Acute Myeloid Leukemia Prognosis Models using Transcriptomic and Clinical Profiles by Studying the Impacts of Dimensionality Reductions and Gene Signatures on Cox-Proportional Hazard
Author:
Sauvé Léonard,Hébert Josée,Sauvageau Guy,Lemieux Sébastien
Abstract
AbstractGene marker extraction to evaluate risk in cancer can refine the diagnosis process and lead to adapted therapies and better survival. These survival analyses can be done through computer systems and Machine Learning (ML) algorithms such as the Cox-Proportional-Hazard model from gene expression (GE) RNA-Seq data. However, optimal tuning of CPH from genome-wide GE data is challenging and poorly assessed so far. In this work we propose to interrogate an Acute Myeloid Leukemia (AML) dataset (Leucegene) to derive key components of the CPH driving down its performance and discovering its sensitivity to various factors in hoping to ameliorate the system. In this study, we compare the projection and selection data reduction techniques, mainly the PCA and LSC17 gene signature in combination with the CPH in a ML framework. Results reveals that CPH performs better with a combination of clinical and genetic expression features. We determine that projections performs better than selections without clinical information. We ascertain that CPH is affected by overfitting and that this overfitting is linked to the number and the content of input covariables. We show that PCA links clinical features via ability to learn from the input data directly and generalizes better than LSC17 on Leucegene. We postulate that projection are preferred than selection on harder task such as assessing risk in the intermediate subset of Leucegene. We extrapolate that these findings apply in the more general context of risk detection via machine learning in cancer. We see that higher capacity models such as CPH-DNNs systems can be improved via survival-derived projections and combat overfitting through heavy regularization.Author summaryThis study aims to investigate the feasibility of using gene expression to evaluate risk in cancer, and to compare the projection and selection data reduction techniques. The study used the Leucegene dataset to compare the PCA method and a previously published 17 genes signature in combination with the Cox-Proportional-Hazard model in a machine learning framework. Results showed that CPH was affected by overfitting and that this overfitting was linked to the number and the content of input covariables. The study found that PCA links clinical features via ability to learn from the input data directly and generalizes better than LSC17 on Leucegene. The study concluded that projections are preferred than selection on harder task such as assessing risk in the intermediate subset of Leucegene and can be tuned to improve their performance.Data availability statementSource code for pipelines and algorithms, as well as gene expression matrices, are available here:https://github.com/lemieux-lab/dimensions_coxph. Access to the Leucegene cohort’s survival times can be granted upon request and following ethical review.
Publisher
Cold Spring Harbor Laboratory
Reference22 articles.
1. Bengio, Yoshua , Ian Goodfellow , and Aaron Courville . (2017). Deep learning (MIT press, Vol. 1).
2. Prognostic gene signatures for non-small-cell lung cancer
3. Regression Models and Life-Tables;Journal of the Royal Statistical Society: Series B (Methodological),1972
4. lifelines: Survival analysis in Python;Journal of Open Source Software,2019
5. Diagnosis and management of AML in adults: 2017 ELN recommendations from an international expert panel