Abstract
AbstractBiochemical constraints on the admissible amino acids at specific sites in proteins leads to heterogeneity of the amino acid substitution process over sites in alignments. It is well known that phylogenetic models of protein sequence evolution that do not account for site heterogeneity are prone to long-branch attraction (LBA) artifacts. Profile mixture models were developed to model heterogeneity of preferred amino acids at sites via a finite distribution of site classes each with a distinct set of equilibrium amino acid frequencies. However, it is unknown whether the large number of parameters in such models associated with the many amino acid frequency classes can adversely affect tree topology estimates because of over-parameterization. Here we demonstrate theoretically that for long sequences, over-parameterization does not create problems for estimation with profile mixture models. Under mild conditions, tree, amino acid frequencies and other model parameters converge to true values as sequence length increases, even when there are large numbers of components in the frequency profile distributions. Because large sample theory does not necessarily imply good behavior for shorter alignments we explore performance of these models with short alignments simulated with tree topologies that are prone to LBA artifacts. We find that over-parameterization is not a problem for complex profile mixture models even when there are many amino acid frequency classes. In fact, simple models with few site classes behave poorly. Interestingly, we also found that misspecification of the amino acid frequency classes does not lead to increased LBA artifacts as long as the estimated cumulative distribution function of the amino acid frequencies at sites adequately approximates the true one. In contrast, misspecification of the amino acid exchangeability rates can severely negatively affect parameter estimation. Finally, we explore the effects of including in the profile mixture model an additional ‘F-class’ representing the overall frequencies of amino acids in the data set. Surprisingly, the F-class does not help parameter estimation significantly, and can decrease the probability of correct tree estimation, depending on the scenario, even though it tends to improve likelihood scores.
Publisher
Cold Spring Harbor Laboratory
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献