Affiliation:
1. Department of Computer Engineering, San Jos´e State University, San Jos´e, CA, USA
2. Department of Computer Science, San Jos´e State University, San Jos´e, CA, USA
3. Department of Plant Biology, Carnegie Institution for Science,
Stanford, CA, USA
Abstract
Background:
Genome assembly tools are used to reconstruct genomic sequences from
raw sequencing data, which are then used for identifying the organisms present in a metagenomic
sample.
Methodology:
More recently, machine learning approaches have been applied to a variety of bioinformatics problems, and in this paper, we explore their use for organism identification. We start by
evaluating several commonly used metagenomic assembly tools, including PhyloFlash, MEGAHIT,
MetaSPAdes, Kraken2, Mothur, UniCycler, and PathRacer, and compare them against state-of-theart deep learning-based machine learning classification approaches represented by DNABERT and
DeLUCS, in the context of two synthetic mock community datasets.
Result:
Our analysis focuses on determining whether ensembling metagenome assembly tools with
machine learning tools have the potential to improve identification performance relative to using the
tools individually.
Conclusion:
We find that this is indeed the case, and analyze the level of effectiveness of potential
tool ensembling for organisms with different characteristics (based on factors such as repetitiveness,
genome size, and GC content).
Publisher
Bentham Science Publishers Ltd.