Affiliation:
1. Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan,
Shandong 250061, China
Abstract
Background:
In recent years, the identification of biomarkers or signatures based on gene expression profiling data has attracted much attention in bioinformatics. The successful discovery of breast cancer (BRCA) biomarkers will be beneficial in reducing the risk of BRCA among patients for early detection.
Methods:
This paper proposes an Ensemble Feature Selection method to screen biomarkers (abbreviat-ed as EFSmarker) for BRCA from publically available gene expression data. Firstly, we employ twelve filter feature selection methods, namely median, variance, Chi-square, Relief, Pearson and Spearman correlation, mutual information, minimal-redundancy-maximal-relevance criterion, ridge regression, decision tree and random forest with Gini index and accuracy index, to calculate the importance (weights or coefficients) of all features on the training dataset. Secondly, we apply the logistic regres-sion classifier on the test dataset to calculate the classification AUC value of each feature subset indi-vidually selected by twelve methods. Thirdly, we provide an ensemble feature selection method by ag-gregating feature importance with classification AUC value. In particular, we establish a feature im-portance score (FIS) to evaluate the importance of each feature underlying all feature selection methods. Finally, the features with higher FIS are taken as identified biomarkers.
Results:
With the direction of the FIS index induced by the EFSmarker method, 12 genes (COL10A1,
COL11A1, MMP11, LOC728264, FIGF, GJB2, INHBA, CD300LG, IGFBP6, PAMR1, CXCL2 and
FXYD1) are regarded as diagnostic biomarkers for BRCA. Especially, COL10A1, ranked first with a
FIS value of 0.663, is identified as the most credible biomarker. The findings justified via gene and protein
expression validation, functional enrichment analysis, literature checking and independent dataset
validation verify the effectiveness and efficiency of these selected biomarkers.
Conclusion:
Our proposed biomarker discovery strategy not only utilizes the feature contribution but also considers the prediction accuracy simultaneously, which may also serve as a model for identifying unknown biomarkers for other diseases from high-throughput gene expression data. The source code and data are available at https://github.com/zpliulab/EFSmarker.
Funder
National Key Research and Development Program of China
National Natural Science Foundation of China
Shandong Provincial Key Research and Development Program
Natural Science Foundation of Shandong Province of China
Fundamental Research Funds for the Central Universities
Publisher
Bentham Science Publishers Ltd.
Subject
Computational Mathematics,Genetics,Molecular Biology,Biochemistry
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献