Abstract
AbstractLarge language models like GPT have shown impressive performance on natural language tasks. Here, we present a novel method to directly adapt these pretrained models to a biological context, specifically single-cell transcriptomics, by representing gene expression data as text. Our Cell2Sentence approach converts each cell’s gene expression profile into a sequence of gene names ordered by expression level. We show that these gene sequences, which we term “cell sentences”, can be used to fine-tune causal language models like GPT-2. Critically, we find that natural language pretraining boosts model performance on cell sentence tasks. When fine-tuned on cell sentences, GPT-2 generates biologically valid cells when prompted with a cell type. Conversely, it can also accurately predict cell type labels when prompted with cell sentences. This demonstrates that language models fine-tuned using Cell2Sentence can gain a biological understanding of single-cell data, while retaining their ability to generate text. Our approach provides a simple, adaptable framework to combine natural language and transcriptomics using existing models and libraries. Our code is available at:https://github.com/vandijklab/cell2sentence-ft.
Publisher
Cold Spring Harbor Laboratory
Reference54 articles.
1. Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural Information Processing Systems. Ed. by I. Guyon et al. Vol.30. Curran Associates, Inc., 2017.
2. Alec Radford et al. “Improving language understanding by generative pre-training”. In: (2018).
3. OpenAI. GPT-4 Technical Report. 2023. arXiv: 2303.08774 [cs.CL].
4. Hugo Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. 2023. arXiv:2307.09288 [cs.CL].
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献