Pfp-fm: an accelerated FM-index-Reference-Cited by-同舟云学术

Pfp-fm: an accelerated FM-index

Published:2024-04-10 Issue:1 Volume:19 Page:
ISSN:1748-7188
Container-title:Algorithms for Molecular Biology
language:en
Short-container-title:Algorithms Mol Biol

Author:

Hong Aaron,Oliva Marco,Köppl Dominik,Bannai Hideo,Boucher Christina,Gagie Travis

Abstract

AbstractFM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing—which takes parameters that let us tune the average length of the phrases—instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a moderate increase in the memory. The source code for

$$\texttt {PFP-FM}$$

PFP - FM is available at https://github.com/AaronHong1024/afm.

Funder

National Human Genome Research Institute,United States

National Science Foundation, United States

Japan Society for the Promotion of Science

Japan Society for the Promotion of Science,Japan

National Human Genome Research Institute

Natural Sciences and Engineering Research Council of Canada

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s13015-024-00260-8.pdf

Reference27 articles.

1. Ferragina P, Fischer J. Suffix arrays on words. In: Ma B, Zhang K, editors. Proceedings of the 18th Annual Symposium Combinatorial Pattern Matching (CPM). London: Springer; 2007. p. 328–39.

2. Deng J-J, Hon W-K, Köppl D. Sadakane K, FM-indexing grammars induced by suffix sorting for long patterns. In: Deng JJ, editor. Proceedings of the IEEE Data Compression Conference (DCC). Snowbird: IEEE; 2022. p. 63–72.

3. Ferragina P, Manzini G. Indexing compressed text. J ACM. 2005;52:552–81.

4. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):25–25.

5. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Cornell Univ. 2013. https://doi.org/10.48550/arXiv.1303.3997.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Optimizing Data Parallelism for FM-Based Short-Read Alignment on the Heterogeneous Non-Uniform Memory Access Architectures;Future Internet;2024-06-19