Affiliation:
1. College of Computer Science and Technology, Jilin University, 2699 Qianjin Street, Changchun 130012, China
2. Department of Computer Science, Jilin Business and Technology College, Changchun, China
Abstract
Full-text index structures are widely used in string matching and bioinformatics. These structures such as DAWGs and suffix trees allow fast searches on texts. In this paper, we present a new partition of the factors of a word, called a consistent minimal linear partition. Based on this partition, we introduce the weighted directed word graph (WDWG), a space-economical full-text index. WDWGs are basically cyclic, which means that they may accept infinite strings. But by assigning weights to edges, the acceptable strings are limited only to the factors of the input string. For a given word w, any factor of w can be indexed by a state of the WDWG and its length. A WDWG of w has at most |w| states and 2|w| - 1 transition edges. We present an on-line algorithm to construct a WDWG for a given word in time linear in the length of the word. Our experiment shows the size of WDWGs is smaller than that of DAWGs for many data sets including DNA sequences, Chinese texts and English texts.
Publisher
World Scientific Pub Co Pte Lt
Subject
Computer Science (miscellaneous)