Author:
Tamura Keiichi,Watanuki Yousuke,Kitakami Hajime,Takahashi Yoshifumi
Abstract
Abstract
Suffix trees, which are trie structures that present the suffixes of sequences (e.g., strings), are widely used for sequence search in different application domains such as, text data mining, bioinformatics and computational biology. In particular, suffix trees are useful in bioinformatics applications, because they can search similar sub-sequences and extract frequent sequence patterns efficiently. In recent years, efficient construction of a suffix tree that allows faster sequence searches has become one of the most important challenges, because the number and size of the data that are stored in sequence databases have been increasing exponentially. This paper proposes a novel parallelization model for approximate sequence matching that uses disk-based suffix trees, which are built on hard disks not on memory, on a multi-core CPU. In the proposed parallelization model, we divide an entire sequence database into two or more sub-databases called partitions. For each partition, we build a disk-based suffix tree and define a task as an approximate sequence matching on one disk-based suffix tree. Moreover, the proposed parallelization model involves a multiple buffering management system to avoid conflicts among CPU-cores. We evaluated the proposed parallelization model using an actual amino acid sequence database on a PC. The experimental results show a substantial improvement in computation performance.
Publisher
Springer Science and Business Media LLC
Reference20 articles.
1. P. Weiner, “Linear pattern matching algorithms,” in Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973), SWAT ’73, pp. 1–11, 1973.
2. E. M. McCreight, “A space-economical suffix tree construction algorithm,” Journal of the ACM, vol. 23, pp. 262–272, Apr. 1976.
3. D. Gusfield, Algorithms on strings, trees, and sequences: computer science and computational biology. New York, NY, USA: Cambridge University Press, 1997.
4. Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel, “Practical methods for constructing suffix trees,” The VLDB Journal, vol. 14, no. 3, pp. 281–299, 200–5.
5. B. Phoophakdee and M. J. Zaki, “Genome-scale disk-based suffix tree indexing,” in Proceedings of the 2007 ACM SIGMOD international conference on Management of data, SIGMOD ’07, pp. 833–844, 2007.