A LASSO-based approach to sample sites for phylogenetic tree search

Author:

Ecker Noa1,Azouri Dana12,Bettisworth Ben34,Stamatakis Alexandros34,Mansour Yishay5,Mayrose Itay2,Pupko Tal1ORCID

Affiliation:

1. The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University , Tel Aviv 69978, Israel

2. School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University , Tel Aviv 69978, Israel

3. Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies , 69118 Heidelberg, Germany

4. Institute of Theoretical Informatics, Karlsruhe Institute of Technology , 76128 Karlsruhe, Germany

5. The Blavatnik School of Computer Science, Raymond & Beverly Sackler Faculty of Exact Sciences, Tel Aviv University , Tel Aviv 69978, Israel

Abstract

Abstract Motivation In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require using a powerful computer cluster. Current tools for alignment trimming prior to phylogenetic analysis do not promise a significant reduction in the alignment size and are claimed to have a negative effect on the accuracy of the obtained tree. Results Here, we propose an artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset. Our approach is based on training a regularized Lasso-regression model that optimizes the log-likelihood prediction accuracy while putting a constraint on the number of sites used for the approximation. We show that computing the likelihood based on 5% of the sites already provides accurate approximation of the tree likelihood based on the entire data. Furthermore, we show that using this Lasso-based approximation during a tree search decreased running-time substantially while retaining the same tree-search performance. Availability and implementation The code was implemented in Python version 3.8 and is available through GitHub (https://github.com/noaeker/lasso_positions_sampling). The datasets used in this paper were retrieved from Zhou et al. (2018) as described in section 3. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

Edmond J. Safra Center for Bioinformatics at Tel Aviv University

The Council for Higher Education

Israel Science Foundation

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Reference37 articles.

1. Subtree transfer operations and their induced metrics on evolutionary trees;Allen;Ann. Comb,2001

2. Harnessing machine learning to guide phylogenetic-tree search algorithms;Azouri;Nat. Commun,2021

3. Maximum likelihood of evolutionary trees: hardness and approximation;Chor;Bioinformatics,2005

4. Gene tree discordance, phylogenetic inference and the multispecies coalescent;Degnan;Trends Ecol. Evol,2009

5. Journal of molecular evolution evolutionary trees from DNA sequences: a maximum likelihood approach;Felsenstein;J. Mol. Evol,1981

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3