Author:
Mahbub Sazan,Sawmya Shashata,Saha Arpita,Reaz Rezwana,Rahman M. Sohel,Bayzid Md. Shamsuzzoha
Abstract
AbstractSpecies tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all species of interest have a common set of genes. Incomplete gene trees can potentially impact the accuracy of phylogenomic inference. We, for the first time, introduce the problem of imputing the quartet distribution induced by a set of incomplete gene trees, which involves adding the missing quartets back to the quartet distribution. We present QT-GILD, an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural language processing (NLP), which learns the quartet distribution in a given set of incomplete gene trees and generates a complete set of quartets accordingly. QT-GILD is a general-purpose technique needing no explicit modeling of the subject system or reasons for missing data or gene tree heterogeneity. Experimental studies on a collection of simulated and empirical data sets suggest that QT-GILD can effectively impute the quartet distribution, which results in a dramatic improvement in the species tree accuracy. Remarkably, QT-GILD not only imputes the missing quartets but it can also account for gene tree estimation error. Therefore, QT-GILD advances the state-of-the-art in species tree estimation from gene trees in the face of missing data. QT-GILD is freely available in open source form at https://github.com/pythonLoader/QT-GILD.
Publisher
Cold Spring Harbor Laboratory
Reference67 articles.
1. Gene Trees in Species Trees
2. Discordance of species trees with their most likely gene trees;PLoS Genetics,2006
3. GENE TREE DISTRIBUTIONS UNDER THE COALESCENT PROCESS
4. Testing the Constant-Rate Neutral Allele Model with Protein Sequence Data
5. M. Nei . Stochastic errors in DNA evolution and molecular phylogeny. In H. Gershowitz , D. L. Rucknagel , and R. E. Tashian , editors, Evolutionary Perspectives and the New Genetics, pages 133 – 147, 1986.