Author:
Dong Yunyun,Yang Wenkai,Wang Jiawen,Zhao Juanjuan,Qiang Yan
Abstract
Effective cancer treatment requires a clear subtype. Due to the small sample size, high dimensionality, and class imbalances of cancer gene data, classifying cancer subtypes by traditional machine learning methods remains challenging. The gcForest algorithm is a combination of machine learning methods and a deep neural network and has been indicated to achieve better classification of small samples of data. However, the gcForest algorithm still faces many challenges when this method is applied to the classification of cancer subtypes. In this paper, we propose an improved gcForest algorithm (MLW-gcForest) to study the applicability of this method to the small sample sizes, high dimensionality, and class imbalances of genetic data. The main contributions of this algorithm are as follows: (1) Different weights are assigned to different random forests according to the classification ability of the forests. (2) We propose a sorting optimization algorithm that assigns different weights to the feature vectors generated under different sliding windows. The MLW-gcForest model is trained on the methylation data of five data sets from the cancer genome atlas (TCGA). The experimental results show that the MLW-gcForest algorithm achieves high accuracy and area under curve (AUC) values for the classification of cancer subtypes compared with those of traditional machine learning methods and state of the art methods. The results also show that methylation data can be effectively used to diagnose cancer.
Funder
National Natural Science Foundation of China
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Cited by
13 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献