Estimating probabilistic context-free grammars for proteins using contact map constraints

Author:

Dyrka Witold1,Pyzik Mateusz1,Coste François2,Talibart Hugo2

Affiliation:

1. Wydział Podstawowych Problemów Techniki, Katedra Inżynierii Biomedycznej, Politechnika Wrocławska, Wrocław, Poland

2. Univ Rennes, Inria, CNRS, IRISA, Rennes, France

Abstract

Interactions between amino acids that are close in the spatial structure, but not necessarily in the sequence, play important structural and functional roles in proteins. These non-local interactions ought to be taken into account when modeling collections of proteins. Yet the most popular representations of sets of related protein sequences remain the profile Hidden Markov Models. By modeling independently the distributions of the conserved columns from an underlying multiple sequence alignment of the proteins, these models are unable to capture dependencies between the protein residues. Non-local interactions can be represented by using more expressive grammatical models. However, learning such grammars is difficult. In this work, we propose to use information on protein contacts to facilitate the training of probabilistic context-free grammars representing families of protein sequences. We develop the theory behind the introduction of contact constraints in maximum-likelihood and contrastive estimation schemes and implement it in a machine learning framework for protein grammars. The proposed framework is tested on samples of protein motifs in comparison with learning without contact constraints. The evaluation shows high fidelity of grammatical descriptors to protein structures and improved precision in recognizing sequences. Finally, we present an example of using our method in a practical setting and demonstrate its potential beyond the current state of the art by creating a grammatical model of a meta-family of protein motifs. We conclude that the current piece of research is a significant step towards more flexible and accurate modeling of collections of protein sequences. The software package is made available to the community.

Funder

National Science Centre, Poland

E-SCIENCE.PL Infrastructure

University of Rennes

Wroclaw Center for Networking and Supercomputing

Publisher

PeerJ

Subject

General Agricultural and Biological Sciences,General Biochemistry, Genetics and Molecular Biology,General Medicine,General Neuroscience

Reference95 articles.

1. Fitting a mixture model by expectation maximization to discover motifs in biopolymers;Bailey,1994

2. Trainable grammars for speech recognition;Baker,1979

3. Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners;Baldassi;PLOS ONE,2014

4. The Protein Data Bank;Berman;Nucleic Acid Research,2000

5. Structural diversity of calcium binding sites;Bindreither;General Physiology and Biophysics,2009

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3