CATHe: Detection of remote homologues for CATH superfamilies using embeddings from protein language models-Reference-Cited by-同舟云学术

CATHe: Detection of remote homologues for CATH superfamilies using embeddings from protein language models

Published:2022-03-13 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Nallapareddy Vamsi^ORCID,Bordin Nicola^ORCID,Sillitoe Ian^ORCID,Heinzinger Michael^ORCID,Littmann Maria^ORCID,Waman Vaishali^ORCID,Sen Neeladri^ORCID,Rost Burkhard^ORCID,Orengo Christine^ORCID

Abstract

1.AbstractCATH is a protein domain classification resource that combines an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues that might be missed by state-of-the-art HMM-based approaches. The proposed algorithm for this task (CATHe) combines a neural network with sequence representations obtained from protein language models. The employed dataset consisted of remote homologues that had less than 20% sequence identity. The CATHe models trained on 1773 largest, and 50 largest CATH superfamilies had an accuracy of 85.6+−0.4, and 98.15+−0.30 respectively. To examine whether CATHe was able to detect more remote homologues than HMM-based approaches, we employed a dataset consisting of protein regions that had annotations in Pfam, but not in CATH. For this experiment, we used highly reliable CATHe predictions (expected error rate <0.5%), which provided CATH annotations for 4.62 million Pfam domains. For a subset of these domains from homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold structures with experimental structures from the CATHe predicted superfamilies.

Publisher

Cold Spring Harbor Laboratory

Reference50 articles.

1. CATH: increased structural coverage of functional space;Nucleic Acids Res,2021

2. Protein Data Bank: the single global archive for 3D macromolecular structure data | Nucleic Acids Research | Oxford Academic.

3. Gene3D: Extensive prediction of globular domains in proteins

4. [36] SSAP: Sequential structure alignment program for protein structure comparison

5. CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures.

Cited by 11 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Human O-linked Glycosylation Site Prediction Using Pretrained Protein Language Model;2023-12-06

2. HumanO-linked Glycosylation Site Prediction Using Pretrained Protein Language Model;2023-10-24

3. The opportunities and challenges posed by the new generation of deep learning-based protein structure predictors;Current Opinion in Structural Biology;2023-04

4. Novel machine learning approaches revolutionize protein knowledge;Trends in Biochemical Sciences;2023-04

5. Nearest neighbor search on embeddings rapidly identifies distant protein relations;Frontiers in Bioinformatics;2022-11-17