Cell2Sentence: Teaching Large Language Models the Language of Biology-Reference-Cited by-同舟云学术

Cell2Sentence: Teaching Large Language Models the Language of Biology

Published:2023-09-14 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Levine Daniel^ORCID,Rizvi Syed Asad^ORCID,Lévy Sacha^ORCID,Pallikkavaliyaveetil Nazreen^ORCID,Wu Ruiming,Han Insu^ORCID,Zheng Zihe,Oliveira Fonseca Antonio Henrique de,Chen Xingyu,Ghadermarzi Sina,Karbasi Amin^ORCID,Dhodapkar Rahul M.^ORCID,van Dijk David^ORCID

Abstract

AbstractLarge language models like GPT have shown impressive performance on natural language tasks. Here, we present a novel method to directly adapt these pretrained models to a biological context, specifically single-cell transcriptomics, by representing gene expression data as text. Our Cell2Sentence approach converts each cell’s gene expression profile into a sequence of gene names ordered by expression level. We show that these gene sequences, which we term “cell sentences”, can be used to fine-tune causal language models like GPT-2. Critically, we find that natural language pretraining boosts model performance on cell sentence tasks. When fine-tuned on cell sentences, GPT-2 generates biologically valid cells when prompted with a cell type. Conversely, it can also accurately predict cell type labels when prompted with cell sentences. This demonstrates that language models fine-tuned using Cell2Sentence can gain a biological understanding of single-cell data, while retaining their ability to generate text. Our approach provides a simple, adaptable framework to combine natural language and transcriptomics using existing models and libraries. Our code is available at:https://github.com/vandijklab/cell2sentence-ft.

Publisher

Cold Spring Harbor Laboratory

Reference54 articles.

1. Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural Information Processing Systems. Ed. by I. Guyon et al. Vol.30. Curran Associates, Inc., 2017.

2. Alec Radford et al. “Improving language understanding by generative pre-training”. In: (2018).

3. OpenAI. GPT-4 Technical Report. 2023. arXiv: 2303.08774 [cs.CL].

4. Hugo Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. 2023. arXiv:2307.09288 [cs.CL].

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Large language model to multimodal large language model: A journey to shape the biological macromolecules to biological sciences and medicine;Molecular Therapy - Nucleic Acids;2024-09

2. Classifying epithelial-mesenchymal transition states in single cell cancer data using large language models;2024-08-19

3. Transformers in single-cell omics: a review and new perspectives;Nature Methods;2024-08

4. GeneRAG: Enhancing Large Language Models with Gene-Related Task by Retrieval-Augmented Generation;2024-06-28

5. CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells;2024-06-06