LucaOne: Generalized Biological Foundation Model with Unified Nucleic Acid and Protein Language-Reference-Cited by-同舟云学术

LucaOne: Generalized Biological Foundation Model with Unified Nucleic Acid and Protein Language

Published:2024-05-14 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

He Yong^ORCID,Fang Pan^ORCID,Shan Yongtao,Pan Yuanfei^ORCID,Wei Yanhong,Chen Yichang,Chen Yihao,Liu Yi,Zeng Zhenyu^ORCID,Zhou Zhan^ORCID,Zhu Feng,Holmes Edward C.,Ye Jieping^ORCID,Li Jun,Shu Yuelong,Shi Mang^ORCID,Li Zhaorong^ORCID

Abstract

In recent years, significant advancements have been observed in the domain of Natural Language Processing(NLP) with the introduction of pre-trained foundational models, paving the way for utilizing similar AI technologies to interpret the language of biology. In this research, we introduce “LucaOne”, a novel pre-trained foundational model designed to integratively learn from the genetic and proteomic languages, encapsulating data from 169,861 species en-compassing DNA, RNA, and proteins. This work illuminates the potential for creating a biological language model aimed at universal bioinformatics appli-cation. Remarkably, through few-shot learning, this model efficiently learns the central dogma of molecular biology and demonstrably outperforms com-peting models. Furthermore, in tasks requiring inputs of DNA, RNA, proteins, or a combination thereof, LucaOne exceeds the state-of-the-art performance using a streamlined downstream architecture, thereby providing empirical ev-idence and innovative perspectives on the potential of foundational models to comprehend complex biological systems.

Publisher

Cold Spring Harbor Laboratory

Reference56 articles.

1. General Nature of the Genetic Code for Proteins

2. The language of genes

3. C. Darwin , The descent of man: and selection in relation to sex (John Murray, Albemarle Street., 1888).

4. Protein linguistics — a grammar for modular protein assembly?

5. M. Barbieri , The organic codes: an introduction to semantic biology (Cambridge Univer-sity Press, 2003).