Is Over-parameterization a Problem for Profile Mixture Models?-Reference-Cited by-同舟云学术

Is Over-parameterization a Problem for Profile Mixture Models?

Published:2022-02-20 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Baños Hector^ORCID,Susko Edward^ORCID,Roger Andrew J.^ORCID

Abstract

AbstractBiochemical constraints on the admissible amino acids at specific sites in proteins leads to heterogeneity of the amino acid substitution process over sites in alignments. It is well known that phylogenetic models of protein sequence evolution that do not account for site heterogeneity are prone to long-branch attraction (LBA) artifacts. Profile mixture models were developed to model heterogeneity of preferred amino acids at sites via a finite distribution of site classes each with a distinct set of equilibrium amino acid frequencies. However, it is unknown whether the large number of parameters in such models associated with the many amino acid frequency classes can adversely affect tree topology estimates because of over-parameterization. Here we demonstrate theoretically that for long sequences, over-parameterization does not create problems for estimation with profile mixture models. Under mild conditions, tree, amino acid frequencies and other model parameters converge to true values as sequence length increases, even when there are large numbers of components in the frequency profile distributions. Because large sample theory does not necessarily imply good behavior for shorter alignments we explore performance of these models with short alignments simulated with tree topologies that are prone to LBA artifacts. We find that over-parameterization is not a problem for complex profile mixture models even when there are many amino acid frequency classes. In fact, simple models with few site classes behave poorly. Interestingly, we also found that misspecification of the amino acid frequency classes does not lead to increased LBA artifacts as long as the estimated cumulative distribution function of the amino acid frequencies at sites adequately approximates the true one. In contrast, misspecification of the amino acid exchangeability rates can severely negatively affect parameter estimation. Finally, we explore the effects of including in the profile mixture model an additional ‘F-class’ representing the overall frequencies of amino acids in the data set. Surprisingly, the F-class does not help parameter estimation significantly, and can decrease the probability of correct tree estimation, depending on the scenario, even though it tends to improve likelihood scores.

Publisher

Cold Spring Harbor Laboratory

Reference82 articles.

1. How Well Does Your Phylogenetic Model Fit Your Data?;Systematic Biology,2018

2. Model selection may not be a mandatory step for phylogeny reconstruction;Nature Communications,2019

3. Al Jewari, C. and S. L. Baldauf . 2022. Conflict over the Eukaryote Root Resides in Strong Outliers, Mosaics and Missing Data Sensitivity of Site-Specific (CAT) Mixture Models. Systematic Biology Syac029.

4. Phylogenomic analyses recover a clade of large-bodied decapodiform cephalopods;Molecular Phylogenetics and Evolution,2021

5. A review of long-branch attraction

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Phylogenomic analyses of ochrophytes (stramenopiles) with an emphasis on neglected lineages;Molecular Phylogenetics and Evolution;2024-09

2. MixtureFinder: Estimating DNA mixture models for phylogenetic analyses;2024-03-21

3. Incongruence in the phylogenomics era;Nature Reviews Genetics;2023-06-27

4. Resolving tricky nodes in the tree of life through amino acid recoding;iScience;2022-12

5. Integrating phylogenetics with intron positions illuminates the origin of the complex spliceosome;2022-09-02