Searching sequence databases for functional homologs using profile HMMs: how to set bit score thresholds?-Reference-Cited by-同舟云学术

Searching sequence databases for functional homologs using profile HMMs: how to set bit score thresholds?

Published:2021-06-25 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Srivastava Jaya^ORCID,Hembrom Ritu,Kumawat Ankita,Balaji Petety V.

Abstract

ABSTRACTMotivationUniProt and BFD databases together have 2.5 billion protein sequences. A large majority of these proteins have been electronically annotated. Automated annotation pipelines, vis-à-vis manual curation, have the advantage of scale and speed but are fraught with relatively higher error rates. This is because sequence homology does not necessarily translate to functional homology, molecular function specification is hierarchic and not all functional families have the same amount of experimental data that one can exploit for annotation. Consequently, customization of annotation workflow is inevitable to minimize annotation errors.ResultsWe discuss possible ways of customizing the search of sequence databases for functional homologs using profile HMMs. Choosing an optimal bit score threshold is a critical step in the application of HMMs, which is illustrated using four Case Studies; the single domain nucleotide sugar 6-dehydrogenase and lysozyme-C families, and SH3 and GT-A domains which are typically found as a part of multi-domain proteins. We also discuss the limitations of using profile HMMs for functional annotation and suggests some possible ways to partially overcome such limitations.Supplementary informationSupplementary_material containing Figures S1-S7 and Tables S1 and S2Supplementary_dataset.xlsx

Publisher

Cold Spring Harbor Laboratory

Reference41 articles.

1. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold

2. Stereo-electronic control of reaction selectivity in short-chain dehydrogenases: Decarboxylation, epimerization, and dehydration

3. Protein multiple alignments: sequence-based versus structure-based programs

4. Stochastic models for heterogeneous DNA sequences

5. Durbin R , Eddy SR , Krogh A , Mitchison G. 1998. Section 5.2, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. 1st ed. Cambridge University Press.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Clues to reaction specificity in PLP ‐dependent fold type I aminotransferases of monosaccharide biosynthesis;Proteins: Structure, Function, and Bioinformatics;2022-02-22

2. Clues to reaction specificity in PLP-dependent fold type I aminotransferases of monosaccharide biosynthesis;2021-09-06