Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers-Reference-Cited by-同舟云学术

Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers

Published:2023-01 Issue: Volume:22 Page:117693512311679
ISSN:1176-9351
Container-title:Cancer Informatics
language:en
Short-container-title:Cancer Inform

Author:

C Lavanya¹,S Pooja¹,Kashyap Abhay H²,Rahaman Abdur²,Niranjan Swarna³,Niranjan Vidya¹

Affiliation:

1. Department of Biotechnology, RV College of Engineering, Bengaluru, Karnataka, India

2. Department of Computer Science and Engineering, RV College of Engineering, Bengaluru, Karnataka, India

3. Department of AIML, RV College of Engineering, Bengaluru, Karnataka, India

Abstract

Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model’s accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.

Publisher

SAGE Publications

Subject

Cancer Research,Oncology

Link

http://journals.sagepub.com/doi/pdf/10.1177/11769351231167992

Reference63 articles.

1. The biology and management of non-small cell lung cancer

2. Varenicline in the treatment of tobacco dependence

3. Non–Small Cell Lung Cancer Radiogenomics Map Identifies Relationships between Molecular and Imaging Phenotypes with Prognostic Implications

4. RNA sequencing: new technologies and applications in cancer research

5. A comprehensive in-silico computational analysis of twenty cancer exome datasets and identification of associated somatic variants reveals potential molecular markers for detection of varied cancer types

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exosome- Machine Learning Integration in Biomedicine: Advancing Diagnosis and Biomarker Discovery;Current Medicinal Chemistry;2024-08-20

2. Specific association ofMTHFD1expressions with small cell lung cancer development and chemoradiotherapy outcome;Saudi Medical Journal;2024-07-28

3. Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data;Mathematics;2024-07-15

4. Groundwater quality assessment using machine learning models: a comprehensive study on the industrial corridor of a semi-arid region;Environmental Science and Pollution Research;2024-07-04

5. Development and evaluation of a chronic kidney disease risk prediction model using random forest;Frontiers in Genetics;2024-06-27