Ambiguity Coding Allows Accurate Inference of Evolutionary Parameters from Alignments in an Aggregated State-Space-Reference-Cited by-同舟云学术

Ambiguity Coding Allows Accurate Inference of Evolutionary Parameters from Alignments in an Aggregated State-Space

Published:2020-04-30 Issue:1 Volume:70 Page:21-32
ISSN:1063-5157
Container-title:Systematic Biology
language:en
Short-container-title:

Author:

Weber Claudia C¹,Perron Umberto¹,Casey Dearbhaile¹,Yang Ziheng²,Goldman Nick¹

Affiliation:

1. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK

2. Department of Genetics, University College London, London WC1E 6BT, UK

Abstract

AbstractHow can we best learn the history of a protein’s evolution? Ideally, a model of sequence evolution should capture both the process that generates genetic variation and the functional constraints determining which changes are fixed. However, in practical terms the most suitable approach may simply be the one that combines the convenience of easily available input data with the ability to return useful parameter estimates. For example, we might be interested in a measure of the strength of selection (typically obtained using a codon model) or an ancestral structure (obtained using structural modeling based on inferred amino acid sequence and side chain configuration).But what if data in the relevant state-space are not readily available? We show that it is possible to obtain accurate estimates of the outputs of interest using an established method for handling missing data. Encoding observed characters in an alignment as ambiguous representations of characters in a larger state-space allows the application of models with the desired features to data that lack the resolution that is normally required. This strategy is viable because the evolutionary path taken through the observed space contains information about states that were likely visited in the “unseen” state-space. To illustrate this, we consider two examples with amino acid sequences as input. We show that $$\omega$$, a parameter describing the relative strength of selection on nonsynonymous and synonymous changes, can be estimated in an unbiased manner using an adapted version of a standard 61-state codon model. Using simulated and empirical data, we find that ancestral amino acid side chain configuration can be inferred by applying a 55-state empirical model to 20-state amino acid data. Where feasible, combining inputs from both ambiguity-coded and fully resolved data improves accuracy. Adding structural information to as few as 12.5% of the sequences in an amino acid alignment results in remarkable ancestral reconstruction performance compared to a benchmark that considers the full rotamer state information. These examples show that our methods permit the recovery of evolutionary information from sequences where it has previously been inaccessible. [Ancestral reconstruction; natural selection; protein structure; state-spaces; substitution models.]

Publisher

Oxford University Press (OUP)

Subject

Genetics,Ecology, Evolution, Behavior and Systematics

Link

http://academic.oup.com/sysbio/advance-article-pdf/doi/10.1093/sysbio/syaa036/33400583/syaa036.pdf

Reference48 articles.

1. Early Pleistocene enamel proteome from Dmanisi resolves Stephanorhinus phylogeny;Cappellini;Nature,2019

2. PoMo: an allele frequency-based approach for species tree estimation;De Maio;Syst. Biol.,2015

3. The Pfam protein families database in 2019;El-Gebali;Nucleic Acids Res.,2018

4. Evolutionary trees from DNA sequences: a maximum likelihood approach;Felsenstein;J. Mol. Evol.,1981

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. DNA Sequences Are as Useful as Protein Sequences for Inferring Deep Phylogenies;Systematic Biology;2023-06-27