Efficient mapping of accurate long reads in minimizer space with mapquik-Reference-Cited by-同舟云学术

Efficient mapping of accurate long reads in minimizer space with mapquik

Published:2023-06-30 Issue: Volume: Page:
ISSN:1088-9051
Container-title:Genome Research
language:en
Short-container-title:Genome Res.

Author:

Ekim Bariş^ORCID,Sahlin Kristoffer^ORCID,Medvedev Paul^ORCID,Berger Bonnie^ORCID,Chikhi Rayan^ORCID

Abstract

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches ofkconsecutively sampled minimizers (k-min-mers) and only indexingk-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps—fundamental bottlenecks to read mapping—for both the human and maize genomes with > 96% sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a 37 × speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a 410 × speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristicO(n)pseudochaining algorithm, which improves upon the long-standingO(nlogn)bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.

Funder

National Science Foundation

National Institutes of Health

European Union's Horizon 2020

ANR Transipedia

SeqDigger

Inception

PRAIRIE

Publisher

Cold Spring Harbor Laboratory

Subject

Genetics (clinical),Genetics

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Improved sub-genomic RNA prediction with the ARTIC protocol;Nucleic Acids Research;2024-08-16

2. Designing efficient randstrobes for sequence similarity analyses;Bioinformatics;2024-03-29

3. The Application of Long-Read Sequencing to Cancer;Cancers;2024-03-25

4. Ultra-fast and High-quality Mapping of Error-prone Long Reads;2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM);2023-12-05

5. Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation;Bioinformatics;2023-08-21