A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases-Reference-Cited by-同舟云学术

A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases

Published:2017-01-27 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Jain Chirag,Dilthey Alexander,Koren Sergey,Aluru Srinivas,Phillippy Adam M.^ORCID

Abstract

AbstractEmerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥ 5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and > 60, 000 genomes.

Publisher

Cold Spring Harbor Laboratory

Reference27 articles.

1. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

2. MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island

3. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing

4. Broder, A.Z. : On the resemblance and containment of documents. In: Compression and Complexity of Sequences 1997. Proceedings. pp. 21–29. IEEE (1997)

5. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory

Cited by 15 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Alignment of Single-Molecule Sequencing Reads by Enhancing the Accuracy and Efficiency of Locality-Sensitive Hashing;2022-05-15

2. Fast and Accurate Algorithms for Mapping and Aligning Long Reads;Journal of Computational Biology;2021-08-01

3. The statistics of k-mers from a sequence undergoing a simple mutation process without spurious matches;2021-01-17

4. Improving the efficiency of de Bruijn graph construction using compact universal hitting sets;2020-11-08

5. Raven: a de novo genome assembler for long reads;2020-08-10