Abstract
1AbstractLong read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10Kbp with high accuracy (99.9%). Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide a way to combine partially constructed assemblies (or contigs) with newly sequenced long reads in order to generate improved and near-complete genomic scaffolds. Under either setting, the main computational bottleneck is the step of mapping the long reads—against other long reads or pre-constructed contigs. While many tools implement the mapping step through alignments and overlap computations, alignment-free approaches have the benefit of scaling in performance. Designing a scalable alignment-free mapping tool while maintaining the accuracy of mapping (precision and recall) is a significant challenge. In this paper, we visit the generic problem of mapping long reads to a database of subject sequences, in a fast and accurate manner. More specifically, we present an efficient parallel algorithmic workflow, calledJEM-mapper, that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads. For implementation and evaluation, we consider two application settings: (i) the hybrid scaffolding setting, where the goal is to map a large collection of long reads to a large collection of partially constructed assemblies or contigs; and (ii) the classical long read assembly setting, where the goal is to map long reads to one another to identify overlapping long reads. Our algorithms and implementations are designed for execution on distributed memory parallel machines. Experimental evaluation shows that our parallel algorithm is highly effective in producing high-quality mapping while significantly improving the time to solution compared to state-of-the-art mapping tools. For instance, in the hybrid setting for a large genomeBetta splendens(≈350Mbpgenome) with 429KHiFi long reads and 98Kcontigs,JEM-mapperproduces a mapping with 99.41% precision and 97.91% recall, while yielding 6.9×speedup over a state-of-the-art mapper.
Publisher
Cold Spring Harbor Laboratory
Reference44 articles.
1. C. E. Mason and O. Elemento , “Faster sequencers, larger datasets, new challenges,” 2012.
2. Three decades of nanopore sequencing
3. Long reads: their purpose and place
4. “Highly accurate long-read hifi sequencing data for five complex genomes;Scientific data,2020
5. P. Morisse , T. LeCroq , and A. LeFeBVre , “Long-read error correction: a survey and qualitative comparison,” BioRxiv, pp. 2020–03, 2021.