Identifying Diagnostic Biomarkers of Breast Cancer Based on Gene Expression Data and Ensemble Feature Selection-Reference-Cited by-同舟云学术

Identifying Diagnostic Biomarkers of Breast Cancer Based on Gene Expression Data and Ensemble Feature Selection

Published:2023-03 Issue:3 Volume:18 Page:232-246
ISSN:1574-8936
Container-title:Current Bioinformatics
language:en
Short-container-title:CBIO

Author:

Li Lingyu¹^ORCID,Algabri Yousif A.¹^ORCID,Liu Zhi-Ping¹^ORCID

Affiliation:

1. Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China

Abstract

Background: In recent years, the identification of biomarkers or signatures based on gene expression profiling data has attracted much attention in bioinformatics. The successful discovery of breast cancer (BRCA) biomarkers will be beneficial in reducing the risk of BRCA among patients for early detection. Methods: This paper proposes an Ensemble Feature Selection method to screen biomarkers (abbreviat-ed as EFSmarker) for BRCA from publically available gene expression data. Firstly, we employ twelve filter feature selection methods, namely median, variance, Chi-square, Relief, Pearson and Spearman correlation, mutual information, minimal-redundancy-maximal-relevance criterion, ridge regression, decision tree and random forest with Gini index and accuracy index, to calculate the importance (weights or coefficients) of all features on the training dataset. Secondly, we apply the logistic regres-sion classifier on the test dataset to calculate the classification AUC value of each feature subset indi-vidually selected by twelve methods. Thirdly, we provide an ensemble feature selection method by ag-gregating feature importance with classification AUC value. In particular, we establish a feature im-portance score (FIS) to evaluate the importance of each feature underlying all feature selection methods. Finally, the features with higher FIS are taken as identified biomarkers. Results: With the direction of the FIS index induced by the EFSmarker method, 12 genes (COL10A1, COL11A1, MMP11, LOC728264, FIGF, GJB2, INHBA, CD300LG, IGFBP6, PAMR1, CXCL2 and FXYD1) are regarded as diagnostic biomarkers for BRCA. Especially, COL10A1, ranked first with a FIS value of 0.663, is identified as the most credible biomarker. The findings justified via gene and protein expression validation, functional enrichment analysis, literature checking and independent dataset validation verify the effectiveness and efficiency of these selected biomarkers. Conclusion: Our proposed biomarker discovery strategy not only utilizes the feature contribution but also considers the prediction accuracy simultaneously, which may also serve as a model for identifying unknown biomarkers for other diseases from high-throughput gene expression data. The source code and data are available at https://github.com/zpliulab/EFSmarker.

Funder

National Key Research and Development Program of China

National Natural Science Foundation of China

Shandong Provincial Key Research and Development Program

Natural Science Foundation of Shandong Province of China

Fundamental Research Funds for the Central Universities

Publisher

Bentham Science Publishers Ltd.

Subject

Computational Mathematics,Genetics,Molecular Biology,Biochemistry

Reference43 articles.

1. Huang H.; Hu J.; Maryam A.; Defining super-enhancer landscape in triple-negative breast cancer by multiomic profiling. Nat Commun 2021,12(1),2242