Abstract
AbstractPhylogenetics has long relied on the use of orthologs, or genes related through speciation events, to infer species relationships. However, identifying orthologs is difficult because gene duplication can obscure relationships among genes. Researchers have been particularly concerned with the insidious effects of pseudoorthologs—duplicated genes that are mistaken for orthologs because they are present in a single copy in each sampled species. Because gene tree topologies of pseudoorthologs may differ from the species tree topology, they have often been invoked as the cause of counterintuitive results in phylogenetics. Despite these perceived problems, no previous work has calculated the probabilities of pseudoortholog topologies, or has been able to circumscribe the regions of parameter space in which pseudoorthologs are most likely to occur. Here, we introduce a model for calculating the probabilities and branch lengths of orthologs and pseudoorthologs, including concordant and discordant pseudoortholog topologies, on a rooted three-taxon species tree. We show that the probability of orthologs is high relative to the probability of pseudoorthologs across reasonable regions of parameter space. Furthermore, the probabilities of the two discordant topologies are equal and never exceed that of the concordant topology, generally being much lower. We describe the species tree topologies most prone to generating pseudoorthologs, finding that they are likely to present problems to phylogenetic inference irrespective of the presence of pseudoorthologs. Overall, our results suggest that pseudoorthologs are less of a problem for phylogenetics than currently believed, which should allow researchers to greatly increase the number of genes used in phylogenetic inference.Significance StatementPhylogenetics has long relied on the use of orthologs, or genes related through speciation events, to infer species relationships. However, filtering datasets to include only orthologs is both difficult and restrictive, drastically limiting the amount of data available for phylogenetic inference. Here, we introduce a model to study the probability and topologies of pseudoorthologs—duplicated genes that are mistaken for orthologs because they are present in a single copy in each sampled species. We show that pseudoorthologs are rare and that, even when they are present, they should not mislead phylogenetic inference. Our results suggest that strict filtering to remove pseudoorthologs unnecessarily limits the amount of data used in phylogenetic inference.
Publisher
Cold Spring Harbor Laboratory
Reference42 articles.
1. C. Scornavacca , F. Delsuc , N. Galtier , Phylogenetics in the genomic era (Open access book available from https://hal.inria.fr/PGE/, 2020).
2. Distinguishing Homologous from Analogous Proteins
3. R. Fernández , T. Gabaldon , C. Dessimoz , “Orthology: definitions, prediction, and impact on species phylogeny inference” in Phylogenetics in the Genomic Era, C. Scornavacca , F. Delsuc , N. Galtier , Eds. (Open access book, 2020), p. 2.4:1–2.4:14.
4. Phylogenetic tree building in the genomic age;Nat Rev Genet,2020
5. A. M. Altenhoff , N. M. Glover , C. Dessimoz , “Inferring orthology and paralogy” in Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology., M. Anisimova , Ed. (Springer, 2019), pp. 149–175.