Clustering FunFams using sequence embeddings improves EC purity-Reference-Cited by-同舟云学术

Clustering FunFams using sequence embeddings improves EC purity

Published:2021-01-21 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Littmann Maria^ORCID,Bordin Nicola^ORCID,Heinzinger Michael^ORCID,Schütze Konstantin^ORCID,Dallago Christian^ORCID,Orengo Christine^ORCID,Rost Burkhard^ORCID

Abstract

AbstractMotivationClassifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be “pure”, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22,830 of 203,639) contain EC annotations and of those, 7% (1,526 of 22,830) have inconsistent functional annotations.ResultsWe propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes.AvailabilityCode and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering

Publisher

Cold Spring Harbor Laboratory

Reference34 articles.

1. Analysis and prediction of functional sub-types from protein sequence alignments

2. CATH – a hierarchic classification of protein domain structures

3. “CATH: Protein Structure Classification Database at UCL.”https://www.cathdb.info/ (accessed Nov. 02, 2020).

4. CATH: increased structural coverage of functional space

5. Functional classification of CATH superfamilies: a domain-based approach for protein function annotation

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Ultra-fast protein structure prediction to capture effects of sequence variation in mutation movies;2022-11-16

2. Contrastive learning on protein embeddings enlightens midnight zone at lightning speed;2021-11-15

3. Computational approaches to predict protein functional families and functional sites;Current Opinion in Structural Biology;2021-10