Using Deep Learning to Annotate the Protein Universe-Reference-Cited by-同舟云学术

Using Deep Learning to Annotate the Protein Universe

Published:2019-05-03 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Bileschi Maxwell L.^ORCID,Belanger David^ORCID,Bryant Drew,Sanderson Theo^ORCID,Carter Brandon^ORCID,Sculley D.,DePristo Mark A.^ORCID,Colwell Lucy J.^ORCID

Abstract

AbstractUnderstanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate 1/3 of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. In this paper, we explore an alternative methodology based on deep learning that learns the relationship between unaligned amino acid sequences and their functional annotations across all 17929 families of the Pfam database. Using the Pfam seed sequences we establish rigorous benchmark assessments that use both random and clustered data splits to control for potentially confounding sequence similarities between train and test sequences. Using Pfam full, we report convolutional networks that are significantly more accurate and computationally efficient than BLASTp, while learning sequence features such as structural disorder and transmembrane helices. Our model co-locates sequences from unseen families in embedding space, allowing sequences from novel families to be accurately annotated. These results suggest deep learning models will be a core component of future protein function prediction tools.

Publisher

Cold Spring Harbor Laboratory

Reference56 articles.

1. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets;Nature biotechnology,2017

2. Clustering huge protein sequence sets in linear time;Nature communications,2018

3. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

4. Protein homology detection by hmm–hmm comparison;Bioinformatics,2004

Cited by 52 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. T Cell Receptor Protein Sequences and Sparse Coding: A Novel Approach to Cancer Classification;Communications in Computer and Information Science;2023-11-26

2. Learning sequence, structure, and function representations of proteins with language models;2023-11-26

3. Unsupervised Deep Learning Can Identify Protein Functional Groups from Unaligned Sequences;Genome Biology and Evolution;2023-05

4. Towards mechanistic models of mutational effects: Deep learning on Alzheimer’s Aβ peptide;Computational and Structural Biotechnology Journal;2023

5. Elucidating the functional roles of prokaryotic proteins using big data and artificial intelligence;FEMS Microbiology Reviews;2023-01