Abstract
AbstractThe rapid evolution of single-cell sequencing technologies has facilitated precise transcriptomics profiling at the single-cell level, shedding light on the intricate heterogeneity within cellular populations. Despite these advances, the inherent diversity of cells and data challenges such as noise, batch effects, and sparsity, underscores the pressing need for a unified model to learn and represent cellular states effectively. Single-cell Large Language Models (LLMs) have been crafted to bridge this gap yet exhibit limited performance on human cells. This short-fall may stem from the confounding effects of training data from diverse species, partly because of limited cells for the single species. Here, we have compiled a dataset of approximately 100 million human cells sequenced by multiple technolo-gies from human single-cell datasets with various file types deposited in public databases and websites. Leveraging these extensive data cohorts, we developed CellFM, a robust single-cell foundation model with an impressive 800 million parameters, marking an eight-fold increase over the current largest single-species model. To ensure the training of CellFM on the MindSpore AI framework from Huawei, we have integrated RetNet, a Transformer architecture variant with lin-ear complexity for a balance between efficiency and performance, serving as the backbone of our model. Our comprehensive experiments have shown that CellFM outperforms existing models across diverse applications, such as cell annotation, perturbation prediction, and gene function prediction.
Publisher
Cold Spring Harbor Laboratory