Abstract
AbstractThe success of large-scale pre-trained language models in the Natural Language Processing (NLP) domain has encouraged their adoption in genomics and single-cell biology. Developing pre-trained models using the rapidly growing single-cell transcriptomic data helps to unravel the intricate language of cells. However, current single-cell pre-trained models primarily focus on learning gene and cell representations from extensive gene expression data, failing to fully comprehend the biological significance of the gene expression patterns and cell types they identify, which leads to limited interpretability and transferability. We propose scKEPLM, a knowledge-enhanced single-cell pre-training language model integrates a biology knowledge graph into the single-cell transcriptome pre-training process. scKEPLM covers over 41 million single-cell RNA sequences and 8.9 million gene relations. Through parallel pre-training of single-cell transcriptome sequences and genetic knowledge, combined with a Gaussian cross-attention mechanism, scKEPLM precisely aligns cell semantics with genetic information, to learn more accurate and comprehensive representations of single-cell transcriptomes. The introduction of knowledge enhancement has improved the identification of important genes in cells by scKEPLM, and greatly enriched the understanding of cell function and disease mechanism. The scKEPLM model has achieved state-of-the-art performance in more than 12 downstream tasks, including gene annotation, cell annotation, and drug response prediction, demonstrating strong generalization and transferability. Further exploration of the model’s interpretability demonstrates its adaptability to variations in gene expression patterns within cells under various physiological or pathological conditions.
Publisher
Cold Spring Harbor Laboratory
Reference44 articles.
1. Recent advances in natural language processing via large pre-trained language models: A survey;ACM Computing Surveys,2023
2. He, K. , Girshick, R. , Dollár, P. : Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019)
3. Theodoris, C.V. , Xiao, L. , Chopra, A. , Chaffin, M.D. , Al Sayed, Z.R. , Hill, M.C. , Mantineo, H. , Brydon, E.M. , Zeng, Z. , Liu, X.S. , et al.: Transfer learning enables predictions in network biology. Nature, 1–9 (2023)
4. Multi-domain translation between single-cell imaging and sequencing data using autoencoders;Nature communications,2021
5. Opportunities and challenges in long-read sequencing data analysis