Abstract
AbstractWhen inferring the evolutionary history of species and the genes they contain, the phylogenetic trees of the genes can be different to that of the species and to each other, due to a variety of causes including incomplete lineage sorting. We often wish to infer the species tree, but only reconstruct the gene trees from sequences. We then combine the gene trees to produce a species tree; methods to do this are known as summary methods, of which ASTRAL is the currently among the most popular. ASTRAL has been shown to be practically accurate in many scenarios through extensive simulations. However, these simulations generally assume that the input gene trees are independent of each other. This is known to be unrealistic, as genes that are close to each other on the chromosome (or are related by function) have dependent phylogenies, due to the absence of unlimited recombination between the genes.In this paper, we develop a model for generating dependent gene trees within a species tree, based on the coalescent with recombination. We then use these trees as input to ASTRAL to reassess its accuracy for dependent gene trees. Our results show that ASTRAL performs more poorly with greater dependence, both when gene trees are known and estimated from sequences. Indeed, the effect of dependence between gene trees is comparable to (if not larger than) the effect of gene tree estimation error. We then re-analyse a 37-taxon mammalian data set; under a realistic recombination rate, the estimated accuracy of ASTRAL decreases substantially (the Robinson-Foulds distance increases by a factor of 4.7) relative to the accuracy previously estimated with independent gene trees, and the effective sample size for this dataset is about one-third of the actual sample size. This shows that the impact of gene tree dependence on the accuracy of ASTRAL (and other summary methods) can be extensive.Author summaryThe study of the evolutionary history of species is important for understanding and reconstructing the history of life on Earth. These evolutionary histories are represented in the form of species trees, which can be reconstructed from the evolutionary histories of the genes contained in the species using so-called species tree inference methods. This is complicated by the fact that the histories of the genes (gene trees) can be related to each other, depending on their placement in the genome or their functions. Gene tree dependence is not taken into account in almost all studies of the accuracy of species tree inference. In this paper, we develop a statistical model to include gene tree dependence, and show that it can significantly affect the accuracy of species tree inference. This effect is at least as large as the impact of incorrect reconstruction of the gene trees themselves, a well-known issue in species tree inference.
Publisher
Cold Spring Harbor Laboratory