Abstract
AbstractTumor mutation trees are the primary tools to model the evolution of cancer. Not only some tumor phylogeny inference methods may produce a set of trees having potential and parallel evolutionary histories, but also mutation trees from different patients may also exhibit similar evolutionary processes. When a set of correlated mutation trees is available, compressing the data into a single best-fit tree, exhibiting the shared evolutionary processes, is definitely of great importance and can be beneficial in many applications. In this study, we present a general setup to study and analyse the problem of finding a best-fit (centroid) tree to a given set of trees and we use our general setup to analyse mutation trees as our main motivation. For this letε:𝒯n→ ℝn×nbe an embedding of labeled rooted trees into the space of real square matrices and also letLbe a norm on this space. We introduce thenearest mapped treeproblem as the problem of finding a closest tree to a given matrixAwith respect toεandL, i.e., a treeT*(A) for whichL(ε(T*(A)) −A) is minimized. Within this setup, our potential candidates for the embedding areadjacency, ancestry, anddistancematrices of trees, where we consider the cases ofL1andL2norms in our analysis. We show that the function d(T1,T2) =L(ε(T1) −ε(T2)) defines a family of dissimilarity measures, covering previously studiedparent-childandancestor-descendentmetrics. Also, we show that the nearest mapped tree problem is polynomial-time solvable for the adjacency matrix embedding and is𝒩𝒫-hard for the ancestry and the distance embeddings. Theweighted centroid tree problemfor a given set of trees of sizekis naturally defined as a nearest mapped tree solution to a weighted sum of the corresponding matrix set. In this article we consider uniform weighted-sums for which all weights are equal, where in particular, the (classical)centroid treeis defined to be a solution when all weights are chosen to be equal to 1/k(i.e., the mean case). Similarly, theω-weighted centroid tree is a solution when all weights are equal toω/k. To show the generality of our setup, we prove that the solution-set of the centroid tree problem for the adjacency and the ancestry matrices are identical to the solution-set of theconsensus tree problemfor parent-child and ancestor-descendent distances already handled by the algorithms GraPhyC(2018) and TuELiP(2023), respectively. Next, to tackle this problem for some new cases, we provide integer linear programs to handle the nearest mapped tree problem for the ancestry and the distance embeddings, giving rise to solutions of the weighted centroid tree problem in these cases. To show the effectiveness of this approach, we provide an algorithm,WAncILP2, to solvethe 2-weighted centroid tree problem for the case of the ancestry matrix and we justify the importance of the weighted setup by showing the pioneering performance ofWAncILP2both in a comprehensive simulation analysis as well as on a real breast cancer dataset, in which, by finding the centroids as representatives of data clusters, we provide supporting evidence for the fact that some common aspects of these centroids can be considered as suitable candidates for reliable evolutionary information in relation to the original data. metrics.
Publisher
Cold Spring Harbor Laboratory
Reference48 articles.
1. XPO1-dependent nuclear export as a target for cancer therapy;In: Journal of Hematology & Oncology,2020
2. Summarizing the solution space in tumor phylogeny inference by multiple consensus trees;In: Bioinformatics,2019
3. The molecular biology of the Notch locus and the fine tuning of differentiation in Drosophila;In: Trends in Genetics,1988
4. On Two Measures of Distance between Fully-Labelled Trees;In: Leibniz International Proceedings in Informatics, LIPIcs,2020
5. Giulia Bernardini et al. “A rearrangement distance for fully-labelled trees”. In: Leibniz International Proceedings in Informatics, LIPIcs 128.23 (2019). arXiv: 1904.01321.