Engineering a compressed suffix tree implementation-Reference-Cited by-同舟云学术

Engineering a compressed suffix tree implementation

Published:2009-12 Issue: Volume:14 Page:
ISSN:1084-6654
Container-title:ACM Journal of Experimental Algorithmics
language:en
Short-container-title:ACM J. Exp. Algorithmics

Author:

Välimäki N.¹,Mäkinen V.¹,Gerlach W.²,Dixit K.³

Affiliation:

1. University of Helsinki, Helsinki, Finland

2. Bielefeld University, AG Genominformatik, Bielefeld

3. IIT Kanpur, New Delhi, India

Abstract

Suffix tree is one of the most important data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes the bottleneck. This is easily explained by the fact that while a DNA sequence of length n from alphabet Σ = { A , C , G , T } can be stored in n log |Σ| = 2 n bits, its suffix tree occupies O ( n log n ) bits. In practice, the size difference easily reaches factor 50. We report on an implementation of the compressed suffix tree very recently proposed by Sadakane (2007). The compressed suffix tree occupies space proportional to the text size, that is, O ( n log |Σ|) bits, and supports all typical suffix tree operations with at most log n factor slowdown. Our experiments show that, for example, on a 10 MB DNA sequence, the compressed suffix tree takes 10% of the space of the normal suffix tree. At the same time, a representative algorithm is slowed down by factor 30. Our implementation follows the original proposal in spirit, but some internal parts are tailored toward practical implementation. Our construction algorithm has time requirement O ( n log n log |Σ|) and uses closely the same space as the final structure while constructing it: on the 10MB DNA sequence, the maximum space usage during construction is only 1.5 times the final product size. As by-products, we develop a method to create Succinct Suffix Array directly from Burrows-Wheeler transform and a space-efficient version of the suffixes-insertion algorithm to build balanced parentheses representation of suffix tree from LCP information.

Funder

Suomen Akatemia

Publisher

Association for Computing Machinery (ACM)

Subject

Theoretical Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/1498698.1594228

Reference36 articles.

1. Replacing suffix trees with enhanced suffix arrays

2. Basic local alignment search tool

3. NATO ISI Series;Apostolico A.

4. Burrows M. and Wheeler D. 1994. A block sorting lossless data compression algorithm. Tech. rep. 124 Digital Equipment Corporation. Burrows M. and Wheeler D. 1994. A block sorting lossless data compression algorithm. Tech. rep. 124 Digital Equipment Corporation.

Cited by 12 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Computing Lexicographic Parsings;2022 Data Compression Conference (DCC);2022-03

2. Space-efficient construction of compressed suffix trees;Theoretical Computer Science;2021-01

3. Extended suffix array construction using Lyndon factors;Sādhanā;2018-07-05

4. Lempel–Ziv Factorization Powered by Space Efficient Suffix Trees;Algorithmica;2017-07-25

5. Lempel Ziv Computation in Small Space (LZ-CISS);Combinatorial Pattern Matching;2015