Estimating the mean in the space of ranked phylogenetic trees-Reference-Cited by-同舟云学术

Estimating the mean in the space of ranked phylogenetic trees

Published:2024-08 Issue:8 Volume:40 Page:
ISSN:1367-4811
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Berling Lars¹^ORCID,Collienne Lena¹^ORCID,Gavryushkin Alex¹^ORCID

Affiliation:

1. Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury , Christchurch 8041, New Zealand

Abstract

Abstract Motivation Reconstructing evolutionary histories of biological entities, such as genes, cells, organisms, populations, and species, from phenotypic and molecular sequencing data is central to many biological, palaeontological, and biomedical disciplines. Typically, due to uncertainties and incompleteness in data, the true evolutionary history (phylogeny) is challenging to estimate. Statistical modelling approaches address this problem by introducing and studying probability distributions over all possible evolutionary histories, but can also introduce uncertainties due to misspecification. In practice, computational methods are deployed to learn those distributions typically by sampling them. This approach, however, is fundamentally challenging as it requires designing and implementing various statistical methods over a space of phylogenetic trees (or treespace). Although the problem of developing statistics over a treespace has received substantial attention in the literature and numerous breakthroughs have been made, it remains largely unsolved. The challenge of solving this problem is 2-fold: a treespace has nontrivial often counter-intuitive geometry implying that much of classical Euclidean statistics does not immediately apply; many parametrizations of treespace with promising statistical properties are computationally hard, so they cannot be used in data analyses. As a result, there is no single conventional method for estimating even the most fundamental statistics over any treespace, such as mean and variance, and various heuristics are used in practice. Despite the existence of numerous tree summary methods to approximate means of probability distributions over a treespace based on its geometry, and the theoretical promise of this idea, none of the attempts resulted in a practical method for summarizing tree samples. Results In this paper, we present a tree summary method along with useful properties of our chosen treespace while focusing on its impact on phylogenetic analyses of real datasets. We perform an extensive benchmark study and demonstrate that our method outperforms currently most popular methods with respect to a number of important ‘quality’ statistics. Further, we apply our method to three empirical datasets ranging from cancer evolution to linguistics and find novel insights into corresponding evolutionary problems in all of them. We hence conclude that this treespace is a promising candidate to serve as a foundation for developing statistics over phylogenetic trees analytically, as well as new computational tools for evolutionary data analyses. Availability and implementation An implementation is available at https://github.com/bioDS/Centroid-Code.

Funder

Royal Society Te Apārangi through a Rutherford Discovery Fellowship

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btae514/58902383/btae514.pdf

Reference78 articles.

1. Rapid evolution and biogeographic spread in a colorectal cancer;Alves;Nat Commun,2019

2. Computing medians and means in hadamard spaces;Bacák;SIAM J Optim,2014

3. Robinson–Foulds supertrees;Bansal;Algorithms Mol Biol,2010

4. Central limit theorems for fréchet means in the space of phylogenetic trees;Barden;Electron J Probab,2013

5. The median procedure for n-trees;Barthélemy;J Classif,1986