Author:
Sedighi-Maman Zahra,Heath Jonathan J.
Abstract
Although lung cancer survival status and survival length predictions have primarily been studied individually, a scheme that leverages both fields in an interpretable way for physicians remains elusive. We propose a two-phase data analytic framework that is capable of classifying survival status for 0.5-, 1-, 1.5-, 2-, 2.5-, and 3-year time-points (phase I) and predicting the number of survival months within 3 years (phase II) using recent Surveillance, Epidemiology, and End Results data from 2010 to 2017. In this study, we employ three analytical models (general linear model, extreme gradient boosting, and artificial neural networks), five data balancing techniques (synthetic minority oversampling technique (SMOTE), relocating safe level SMOTE, borderline SMOTE, adaptive synthetic sampling, and majority weighted minority oversampling technique), two feature selection methods (least absolute shrinkage and selection operator (LASSO) and random forest), and the one-hot encoding approach. By implementing a comprehensive data preparation phase, we demonstrate that a computationally efficient and interpretable method such as GLM performs comparably to more complex models. Moreover, we quantify the effects of individual features in phase I and II by exploiting GLM coefficients. To the best of our knowledge, this study is the first to (a) implement a comprehensive data processing approach to develop performant, computationally efficient, and interpretable methods in comparison to black-box models, (b) visualize top factors impacting survival odds by utilizing the change in odds ratio, and (c) comprehensively explore short-term lung cancer survival using a two-phase approach.
Subject
Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry
Reference41 articles.
1. Cancer
https://www.who.int/cancer/en
2. Key Statistics for Lung Cancer?
https://www.cancer.org/cancer/lung-cancer/about/key-statistics.html
3. Lung cancer: Biology and treatment options
4. Surveillance, Epidemiology, and End Results (SEER) Program Research Data (1975–2018), National Cancer Institute, DCCPS, Surveillance Research Program, Based on the November 2020 Submission,2021
5. The Positive Lymph Node Ratio Predicts Survival in T1−4N1−3M0 Non-Small Cell Lung Cancer: A Nomogram Using the SEER Database
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献