Abstract
1.AbstractCATH is a protein domain classification resource that combines an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues that might be missed by state-of-the-art HMM-based approaches. The proposed algorithm for this task (CATHe) combines a neural network with sequence representations obtained from protein language models. The employed dataset consisted of remote homologues that had less than 20% sequence identity. The CATHe models trained on 1773 largest, and 50 largest CATH superfamilies had an accuracy of 85.6+−0.4, and 98.15+−0.30 respectively. To examine whether CATHe was able to detect more remote homologues than HMM-based approaches, we employed a dataset consisting of protein regions that had annotations in Pfam, but not in CATH. For this experiment, we used highly reliable CATHe predictions (expected error rate <0.5%), which provided CATH annotations for 4.62 million Pfam domains. For a subset of these domains from homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold structures with experimental structures from the CATHe predicted superfamilies.
Publisher
Cold Spring Harbor Laboratory
Reference50 articles.
1. CATH: increased structural coverage of functional space;Nucleic Acids Res,2021
2. Protein Data Bank: the single global archive for 3D macromolecular structure data | Nucleic Acids Research | Oxford Academic.
3. Gene3D: Extensive prediction of globular domains in proteins
4. [36] SSAP: Sequential structure alignment program for protein structure comparison
5. CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures.
Cited by
11 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献