Abstract
AbstractMetagenomic binning, the process of grouping DNA sequences into taxonomic units, is critical for understanding the functions, interactions, and evolutionary dynamics of microbial communities. We propose a deep learning approach to binning using two neural networks, one based on composition and another on environmental abundance, dynamically weighting the contribution of each based on characteristics of the input data. Trained on over 43,000 prokaryotic genomes, our network for composition-based binning is inspired by metric learning techniques used for facial recognition.Using a task-specific, multi-GPU accelerated algorithm to cluster the embeddings produced by our network, our binner leverages marker genes observed to be universally present in nearly all taxa to grade and select optimal clusters of sequences from a hierarchy of candidates.We evaluate our approach on four simulated datasets with known ground truth. Our linear time integration of marker genes recovers more near complete genomes than state of the art but computationally infeasible solutions using them, while being over an order of magnitude faster. Finally, we demonstrate the scalability and acuity of our approach by testing it on three of the largest metagenome assemblies ever performed. Compared to other binners, we produced 47%-183% more near complete genomes. From these datasets, we find over the genomes of over 3000 new candidate species which have never been previously cataloged, representing a potential 4% expansion of the known bacterial tree of life.
Publisher
Cold Spring Harbor Laboratory