Online Phylogenetics with matOptimize Produces Equivalent Trees and is Dramatically More Efficient for Large SARS-CoV-2 Phylogenies than de novo and Maximum-Likelihood Implementations

Author:

Kramer Alexander M12ORCID,Thornlow Bryan12ORCID,Ye Cheng3,De Maio Nicola4ORCID,McBroome Jakob12ORCID,Hinrichs Angie S2ORCID,Lanfear Robert5,Turakhia Yatish3ORCID,Corbett-Detig Russell12

Affiliation:

1. Department of Biomolecular Engineering, University of California Santa Cruz , Santa Cruz, CA 95064 , USA

2. Genomics Institute, University of California Santa Cruz , Santa Cruz, CA 95064 , USA

3. Department of Electrical and Computer Engineering, University of California San Diego , San Diego, CA 92093 , USA

4. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI) , Wellcome Genome Campus, Cambridge CB10 1SD , UK

5. Department of Ecology and Evolution, Research School of Biology, Australian National University , Canberra, ACT 2601 , Australia

Abstract

Abstract Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 data sets do not fit this mold. There are currently over 14 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an “online” approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) and pseudo-ML methods may be more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger data sets. Here, we evaluate the performance of de novo and online phylogenetic approaches, as well as ML, pseudo-ML, and MP frameworks for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimization with UShER and matOptimize produces equivalent SARS-CoV-2 phylogenies to some of the most popular ML and pseudo-ML inference tools. MP optimization with UShER and matOptimize is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo inference. Our results therefore suggest that parsimony-based methods like UShER and matOptimize represent an accurate and more practical alternative to established ML implementations for large SARS-CoV-2 phylogenies and could be successfully applied to other similar data sets with particularly dense sampling and short branch lengths.

Funder

National Institutes of Health

University of California

European Molecular Biology Laboratory

Australian Research Council

Chan-Zuckerberg Initiative

Schmidt Futures

Publisher

Oxford University Press (OUP)

Subject

Genetics,Ecology, Evolution, Behavior and Systematics

Reference70 articles.

1. Is ACCTRAN better than DELTRAN;Agnarsson;Cladistics,2008

2. Emergence and expansion of SARS-CoV-2 B.1.526 after identification in New York. Nature;Annavajhala,2021

3. Phylogenetic signal and bias in paleontology;Asher;Syst. Biol,2022

4. EPA-ng: massively parallel evolutionary placement of genetic sequences;Barbera;Syst. Biol,2019

5. Performance, accuracy, and Web server for evolutionary placement of short sequence reads under maximum likelihood;Berger;Syst. Biol,2011

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3