Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering

Author:

Seth Soumita12ORCID,Mallik Saurav34ORCID,Islam Atikul5ORCID,Bhadra Tapas2ORCID,Roy Arup6ORCID,Singh Pawan Kumar7ORCID,Li Aimin8ORCID,Zhao Zhongming49ORCID

Affiliation:

1. Department of Computer Science and Engineering, Future Institute of Engineering and Management, Narendrapur, Kolkata 700150, West Bengal, India

2. Department of Computer Science and Engineering, Aliah University, Kolkata 700160, West Bengal, India

3. Department of Environmental Health, Harvard T H Chan School of Public Health, Boston, MA 02115, USA

4. Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA

5. Department of Computer Science and Engineering, University of Kalyani, Kalyani 741235, West Bengal, India

6. Department of Computer Science and Engineering, Budge Budge Institute of Technology, Kolkata 700137, West Bengal, India

7. Department of Information Technology, Jadavpur University, Jadavpur University Second Campus, Plot No. 8, Salt Lake Bypass, LB Block, Sector III, Kolkata 700106, West Bengal, India

8. Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710048, China

9. Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA

Abstract

In this current era, the identification of both known and novel cell types, the representation of cells, predicting cell fates, classifying various tumor types, and studying heterogeneity in various cells are the key areas of interest in the analysis of single-cell RNA sequencing (scRNA-seq) data. Due to the nature of the data, cluster identification in single-cell sequencing data with high dimensions presents several difficulties. In this paper, we introduce a new framework that combines various strategies such as imputed matrix, minimum redundancy maximum relevance (MRMR) feature selection, and shrinkage clustering to discover gene signatures from scRNA-seq data. Firstly, we conducted the pre-filtering of the “drop-out” value in the data focusing solely on imputing the identified “drop-out” values. Next, we applied the MRMR feature selection method to the imputed data and obtained the top 100 features based on the MRMR feature selection optimization scores for further downstream analysis. Thereafter, we employed shrinkage clustering on the selected feature matrix to identify the cell clusters using a global optimization approach. Finally, we applied the Limma-Voom R tool employing voom normalization and an empirical Bayes test to detect differentially expressed features with a false discovery rate (FDR) < 0.001. In addition, we performed the KEGG pathway and gene ontology enrichment analysis of the identified biomarkers using David 6.8 software. Furthermore, we conducted miRNA target detection for the top gene markers and performed miRNA target gene interaction network analysis using the Cytoscape online tool. Subsequently, we compared our detected 100 markers with our previously detected top 100 cluster-specified markers ranked by FDR of the latest published article and discovered three common markers; namely, Cyp2b10, Mt1, Alpi, along with 97 novel markers. In addition, the Gene Set Enrichment Analysis (GSEA) of both marker sets also yields similar outcomes. Apart from this, we performed another comparative study with another published method, demonstrating that our model detects more significant markers than that model. To assess the efficiency of our framework, we apply it to another dataset and identify 20 strongly significant up-regulated markers. Additionally, we perform a comparative study of different imputation methods and include an ablation study to prove that every key phase of our framework is essential and strongly recommended. In summary, our proposed integrated framework efficiently discovers differentially expressed stronger gene signatures as well as up-regulated markers in single-cell RNA sequencing data.

Funder

Cancer Prevention and Research Institute of Texas

Publisher

MDPI AG

Subject

General Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3