Abstract
Cancer, in any of its forms, remains a significant public health concern worldwide. Advances in early detection and treatment could lead to a decline in the overall death rate from cancer in recent decades. Therefore, tumor prediction and classification play an important role in fighting cancer. This study built computational models for a joint analysis of RNA seq, copy number variation (CNV), and DNA methylation to classify normal and tumor samples across liver cancer, breast cancer, and colon adenocarcinoma from The Cancer Genome Atlas (TCGA) dataset. Total of 18 machine learning methods were evaluated based on the AUC, precision, recall, and F-measure. Besides, five techniques were compared to ameliorate problems of class imbalance in the cancer datasets. Synthetic Minority Oversampling Technique (SMOTE) demonstrated the best performance. The results indicate that the model applying Stochastic Gradient Descent (SGD) for learning binary class SVM with hinge loss has the highest classification results on liver cancer and breast cancer datasets, with accuracy over 99% and AUC greater than or equal to 0.999. For colon adenocarcinoma dataset, both SGD and Sequential Minimal Optimization (SMO) that implements John Platt’s sequential minimal optimization algorithm for training a support vector machine shows an outstanding classification performance with accuracy of 100%, AUC, precision, recall, and F-measure all at 1.000.
Publisher
Public Library of Science (PLoS)
Reference51 articles.
1. Cancer statistics;RL Siegel;CA Cancer J Clin,2023
2. Assessment of circulating microRNAs in plasma of lung cancer patients;O Fortunato;Molecules,2014
3. Cancer diagnosis and prognosis decoded by blood-based circulating microRNA signatures;D Madhavan;Frontiers in genetics,2013
4. Breast cancer diagnosis based on genomic data and extreme learning machine;N Jazayeri;SN Applied Sciences,2020
5. Machine learning for multi-omics data integration in cancer;Z Cai;Iscience,2022