Scaling deep phylogenetic embedding to ultra-large reference trees: a tree-aware ensemble approach-Reference-Cited by-同舟云学术

Scaling deep phylogenetic embedding to ultra-large reference trees: a tree-aware ensemble approach

Published:2023-03-29 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Jiang Yueyu^ORCID,McDonald Daniel^ORCID,Knight Rob^ORCID,Mirarab Siavash^ORCID

Abstract

AbstractPhylogenetic placement of a query sequence on a backbone tree is increasingly used across biomedical sciences to identify the content of a sample from its DNA content. The accuracy of such analyses depends on the density of the backbone tree, making it crucial that placement methods scale to very large trees. Moreover, a new paradigm has been recently proposed to place sequences on the species tree using single-gene data. The goal is to better characterize the samples and to enable combined analyses of marker-gene (e.g., 16S rRNA gene amplicon) and genome-wide data. The recent method DEPP enables performing such analyses using metric learning. However, metric learning is hampered by a need to compute and save a quadratically growing matrix of pairwise distances during training. Thus, DEPP (or any distance-based method) does not scale to more than roughly ten thousand species, a problem that we faced when trying to use our recently released Greengenes2 (GG2) reference tree containing 331,270 species. Scalability problems can be addressed in phylogenetics using divide- and-conquer. However, applying divide- and-conquer to data-hungry machine learning methods needs nuance. This paper explores divide- and-conquer for training ensembles of DEPP models, culminating in a method called C-DEPP that uses carefully crafted techniques to enable quasi-linear scaling while maintaining accuracy. C-DEPP enables placing twenty million 16S fragments on the GG2 reference tree in 41 hours of computation.

Publisher

Cold Spring Harbor Laboratory

Reference42 articles.

1. Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns

2. Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0

3. TreeCluster: Clustering biological sequences using phylogenetic trees

4. APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments

5. Fast and accurate distance‐based phylogenetic placement using divide and conquer

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Greengenes2 unifies microbial data in a single reference tree;Nature Biotechnology;2023-07-27

2. Greengenes2 enables a shared data universe for microbiome studies;2022-12-20