Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences-Reference-Cited by-同舟云学术

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Published:2021-04-05 Issue:15 Volume:118 Page:e2016239118
ISSN:0027-8424
Container-title:Proceedings of the National Academy of Sciences
language:en
Short-container-title:Proc Natl Acad Sci USA

Author:

Rives Alexander^ORCID,Meier Joshua,Sercu Tom^ORCID,Goyal Siddharth,Lin Zeming,Liu Jason,Guo Demi,Ott Myle,Zitnick C. Lawrence,Ma Jerry,Fergus Rob

Abstract

In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.

Funder

National Science Foundation

Publisher

Proceedings of the National Academy of Sciences

Subject

Multidisciplinary

Reference80 articles.

1. Protein Structure Relationships Revealed by Mutational Analysis

2. Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus

3. Coordinated amino acid changes in homologous protein families

5. Distributional Structure

Cited by 1336 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Language models can identify enzymatic binding sites in protein sequences;Computational and Structural Biotechnology Journal;2024-12

2. TCR-ESM: Employing protein language embeddings to predict TCR-peptide-MHC binding;Computational and Structural Biotechnology Journal;2024-12

3. DeepNeuropePred: A robust and universal tool to predict cleavage sites from neuropeptide precursors by protein language model;Computational and Structural Biotechnology Journal;2024-12

4. Structure-based protein and small molecule generation using EGNN and diffusion models: A comprehensive review;Computational and Structural Biotechnology Journal;2024-12

5. T4SEpp: A pipeline integrating protein language models to predict bacterial type IV secreted effectors;Computational and Structural Biotechnology Journal;2024-12