scMulan: a multitask generative pre-trained language model for single-cell analysis-Reference-Cited by-同舟云学术

scMulan: a multitask generative pre-trained language model for single-cell analysis

Published:2024-01-29 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Bian Haiyang^ORCID,Chen Yixin,Dong Xiaomin,Li Chen,Hao Minsheng,Chen Sijie,Hu Jinyi,Sun Maosong,Wei Lei,Zhang Xuegong^ORCID

Abstract

AbstractGene expression could be perceived as a form of cell language, with underlying regulatory mechanisms akin to biological grammar. Decoding this “language” is critical in understanding cellular functions and behaviors, but presents significant challenges. Several works have attempted to learn the biological language by pre-training large foundation models based on single-cell transcriptomic data, inspired by the success of large language models in natural language processing. In this study, we further enrich the pre-training paradigm by integrating an abundance of metadata and a multiplicity of pre-training tasks, and obtain scMulan, a multitask generative pre-trained language model tailored for single-cell analysis. We represent a cell as a structured cell sentence (c-sentence) by encoding its gene expression, metadata terms, and target tasks as words of tuples, each consisting of entities and their corresponding values. We construct a unified generative framework to model the cell language on c-sentence and design three pretraining tasks to bridge the microscopic and macroscopic information within the c-sentences. We pre-train scMulan on 10 million single-cell transcriptomic data and their corresponding metadata, with 368 million parameters. As a single model, scMulan can accomplish tasks zero-shot for cell type annotation, batch integration, and conditional cell generation, guided by different task prompts. Also, scMulan is ready to be expanded for novel tasks through finetuning. We have evaluated the effectiveness of scMulan on multiple downstream tasks. As a foundation model, scMulan is pre-trained to capture both the microscopic regulations and macroscopic patterns of gene expression, positioning it as a multifunctional and easily expandable tool for comprehensive single-cell analysis.

Publisher

Cold Spring Harbor Laboratory

Reference27 articles.

1. Vaswani A , Shazeer N , Parmar N , Uszkoreit J , Jones L , Gomez AN , et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30.

2. Touvron H , Lavril T , Izacard G , Martinet X , Lachaux M-A , Lacroix T , et al. Llama: Open and efficient foundation language models. ArXiv Prepr ArXiv230213971. 2023;

3. Touvron H , Martin L , Stone K , Albert P , Almahairi A , Babaei Y , et al. Llama 2: Open foundation and fine-tuned chat models. ArXiv Prepr ArXiv230709288. 2023;

4. Bommasani R , Hudson DA , Adeli E , Altman R , Arora S , von Arx S , et al. On the opportunities and risks of foundation models. ArXiv Prepr ArXiv210807258. 2021;

5. Language models are unsupervised multitask learners;OpenAI Blog,2019

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Profiling cell identity and tissue architecture with single-cell and spatial transcriptomics;Nature Reviews Molecular Cell Biology;2024-08-21