Author:
Bhowmik Oieswarya,Rahman Tazin,Kalyanaraman Ananth
Abstract
Abstract
Background
Genome assembly, which involves reconstructing a target genome, relies on scaffolding methods to organize and link partially assembled fragments. The rapid evolution of long read sequencing technologies toward more accurate long reads, coupled with the continued use of short read technologies, has created a unique need for hybrid assembly workflows. The construction of accurate genomic scaffolds in hybrid workflows is complicated due to scale, sequencing technology diversity (e.g., short vs. long reads, contigs or partial assemblies), and repetitive regions within a target genome.
Results
In this paper, we present a new parallel workflow for hybrid genome scaffolding that would allow combining pre-constructed partial assemblies with newly sequenced long reads toward an improved assembly. More specifically, the workflow, called , is aimed at generating long scaffolds of a target genome, from two sets of input sequences—an already constructed partial assembly of contigs, and a set of newly sequenced long reads. Our scaffolding approach internally uses an alignment-free mapping step to build a $$\langle $$
⟨
contig,contig$$\rangle $$
⟩
graph using long reads as linking information. Subsequently, this graph is used to generate scaffolds. We present and evaluate a graph-theoretic “wiring” heuristic to perform this scaffolding step. To enable efficient workload management in a parallel setting, we use a batching technique that partitions the scaffolding tasks so that the more expensive alignment-based assembly step at the end can be efficiently parallelized. This step also allows the use of any standalone assembler for generating the final scaffolds.
Conclusions
Our experiments with on a variety of input genomes, and comparison against two state-of-the-art hybrid scaffolders demonstrate that is able to generate longer and more accurate scaffolds substantially faster. In almost all cases, the scaffolds produced by are at least an order of magnitude longer (in some cases two orders) than the scaffolds produced by state-of-the-art tools. runs significantly faster too, reducing time-to-solution from hours to minutes for most input cases. We also performed a coverage experiment by varying the sequencing coverage depth for long reads, which demonstrated the potential of to generate significantly longer scaffolds in low coverage settings ($$1\times $$
1
×
–$$10\times $$
10
×
).
Funder
National Science Foundation
Publisher
Springer Science and Business Media LLC
Reference61 articles.
1. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. Genbank. Nucleic Acids Res. 2012;41(D1):D36–42.
2. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using sspace. Bioinformatics. 2011;27(4):578–9.
3. Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman JA, Chapuis G, Chikhi R, et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience. 2013;2(1):2047-217X.
4. Broder AZ. On the resemblance and containment of documents. In: Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 1997;21–29. IEEE.
5. Cechova M. Probably correct: rescuing repeats with short and long reads. Genes. 2020;12(1):48.