Affiliation:
1. Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA
Abstract
Abstract
Motivation
Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on protein data besides sequences, or lack generalizability to novel sequences, species and functions.
Results
To overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model using only sequence information for proteins, named Transformer-based protein function Annotation through joint sequence–Label Embedding (TALE). For generalizability to novel sequences we use self-attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions (tail labels), we embed protein function labels (hierarchical GO terms on directed graphs) together with inputs/features (1D sequences) in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low similarity, new species, or rarely annotated functions compared to training data, revealing deep insights into the protein sequence–function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability; and a GO term-centric analysis was also provided.
Availability and implementation
The data, source codes and models are available at https://github.com/Shen-Lab/TALE.
Supplementary information
Supplementary data are available at Bioinformatics online.
Funder
National Institute of General Medical Sciences
Publisher
Oxford University Press (OUP)
Subject
Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability
Reference32 articles.
1. Gene ontology: tool for the unification of biology;Ashburner;Nat. Genet,2000
2. Initializing neural networks for hierarchical multi-label text classification;Baker;Assoc. Comput. Ling,2017
3. Fast and sensitive protein alignment using diamond;Buchfink;Nat. Methods,2015
4. Information-theoretic evaluation of predicted ontological annotations;Clark;Bioinformatics,2013
Cited by
74 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献