Data mining of enzymes using specific peptides

Author:

Weingart Uri,Lavi Yair,Horn David

Abstract

Abstract Background Predicting the function of a protein from its sequence is a long-standing challenge of bioinformatic research, typically addressed using either sequence-similarity or sequence-motifs. We employ the novel motif method that consists of Specific Peptides (SPs) that are unique to specific branches of the Enzyme Commission (EC) functional classification. We devise the Data Mining of Enzymes (DME) methodology that allows for searching SPs on arbitrary proteins, determining from its sequence whether a protein is an enzyme and what the enzyme's EC classification is. Results We extract novel SP sets from Swiss-Prot enzyme data. Using a training set of July 2006, and test sets of July 2008, we find that the predictive power of SPs, both for true-positives (enzymes) and true-negatives (non-enzymes), depends on the coverage length of all SP matches (the number of amino-acids matched on the protein sequence). DME is quite different from BLAST. Comparing the two on an enzyme test set of July 2008, we find that DME has lower recall. On the other hand, DME can provide predictions for proteins regarded by BLAST as having low homologies with known enzymes, thus supplying complementary information. We test our method on a set of proteins belonging to 10 bacteria, dated July 2008, establishing the usefulness of the coverage-length cutoff to determine true-negatives. Moreover, sifting through our predictions we find that some of them have been substantiated by Swiss-Prot annotations by July 2009. Finally we extract, for production purposes, a novel SP set trained on all Swiss-Prot enzymes as of July 2009. This new set increases considerably the recall of DME. The new SP set is being applied to three metagenomes: Sargasso Sea with over 1,000,000 proteins, producing predictions of over 220,000 enzymes, and two human gut metagenomes. The outcome of these analyses can be characterized by the enzymatic profile of the metagenomes, describing the relative numbers of enzymes observed for different EC categories. Conclusions Employing SPs for predicting enzymatic activity of proteins works well once one utilizes coverage-length criteria. In our analysis, L ≥ 7 has led to highly accurate results.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3