Efficient mapping of accurate long reads in minimizer space with mapquik

Author:

Ekim BarişORCID,Sahlin KristofferORCID,Medvedev PaulORCID,Berger BonnieORCID,Chikhi RayanORCID

Abstract

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches ofkconsecutively sampled minimizers (k-min-mers) and only indexingk-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps—fundamental bottlenecks to read mapping—for both the human and maize genomes with > 96% sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a 37 × speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a 410 × speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristicO(n)pseudochaining algorithm, which improves upon the long-standingO(nlogn)bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.

Funder

National Science Foundation

National Institutes of Health

European Union's Horizon 2020

ANR Transipedia

SeqDigger

Inception

PRAIRIE

Publisher

Cold Spring Harbor Laboratory

Subject

Genetics (clinical),Genetics

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Ultra-fast and High-quality Mapping of Error-prone Long Reads;2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM);2023-12-05

2. UniAligner: a parameter-free framework for fast sequence alignment;Nature Methods;2023-08-14

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3