Scaling DEPP phylogenetic placement to ultra-large reference trees: a tree-aware ensemble approach-Reference-Cited by-同舟云学术

Scaling DEPP phylogenetic placement to ultra-large reference trees: a tree-aware ensemble approach

Published:2024-06 Issue:6 Volume:40 Page:
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Jiang Yueyu¹^ORCID,McDonald Daniel²,Perry Daniela²,Knight Rob²³,Mirarab Siavash¹³

Affiliation:

1. Electrical and Computer Engineering Department, University of California San Diego , 9500 Gilman Dr , La Jolla, CA, 92093, United States

2. Pediatrics Department, University of California San Diego , 9500 Gilman Dr , La Jolla, CA, 92093, United States

3. Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego , 9500 Gilman Dr , La Jolla, CA, 92093, United States

Abstract

Abstract Motivation Phylogenetic placement of a query sequence on a backbone tree is increasingly used across biomedical sciences to identify the content of a sample from its DNA content. The accuracy of such analyses depends on the density of the backbone tree, making it crucial that placement methods scale to very large trees. Moreover, a new paradigm has been recently proposed to place sequences on the species tree using single-gene data. The goal is to better characterize the samples and to enable combined analyses of marker-gene (e.g., 16S rRNA gene amplicon) and genome-wide data. The recent method DEPP enables performing such analyses using metric learning. However, metric learning is hampered by a need to compute and save a quadratically growing matrix of pairwise distances during training. Thus, the training phase of DEPP does not scale to more than roughly 10 000 backbone species, a problem that we faced when trying to use our recently released Greengenes2 (GG2) reference tree containing 331 270 species. Results This paper explores divide-and-conquer for training ensembles of DEPP models, culminating in a method called C-DEPP. While divide-and-conquer has been extensively used in phylogenetics, applying divide-and-conquer to data-hungry machine-learning methods needs nuance. C-DEPP uses carefully crafted techniques to enable quasi-linear scaling while maintaining accuracy. C-DEPP enables placing 20 million 16S fragments on the GG2 reference tree in 41 h of computation. Availability and implementation The dataset and C-DEPP software are freely available at https://github.com/yueyujiang/dataset_cdepp/.

Funder

National Institute of Health

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btae361/58238097/btae361.pdf

Reference47 articles.

1. Deblur rapidly resolves single-nucleotide community sequence patterns;Amir;mSystems,2017

2. Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0;Asnicar;Nat Commun,2020

3. TreeCluster: clustering biological sequences using phylogenetic trees;Balaban;PLoS One,2019

4. APPLES: scalable distance-based phylogenetic placement with or without alignments;Balaban;Syst Biol,2020

5. Fast and accurate distance–based phylogenetic placement using divide and conquer;Balaban;Mol Ecol Resour,2022