Affiliation:
1. University of Winnipeg, Canada
Abstract
In this paper, we discuss an efficient and effective index mechanism for the string matching with
k
differences, by which we will find all the substrings of a target string
y
of length
n
that align with a pattern string
x
of length
m
with not more than
k
insertions, deletions, and mismatches. A typical application is the searching of a DNA database, where the size of a genome sequence in the database is much larger than that of a pattern. For example,
n
is often on the order of millions or billions while
m
is just a hundred or a thousand. The main idea of our method is to transform
y
to a BWT-array as an index, denoted as
BWT
(
y
), and search
x
against it. The time complexity of our method is bounded by O(
k
· |
T
|), where
T
is a tree structure dynamically generated during a search of
BWT
(
y
). The average value of |
T
| is bounded by O(|Σ|
2
k
), where Σ is an alphabet from which we take symbols to make up target and pattern strings. This time complexity is better than previous strategies when
k
≤ O(log
|Σ|
n
). The general working process consists of two steps. In the first step,
x
is decomposed into a series of
l
small subpatterns, and
BWT
(
y
) is utilized to speedup the process to figure out all the occurrences of such subpatterns with ⌊
k/l
⌋ differences. In the second step, all the found occurrences in the first step will be rechecked to see whether they really match
x
, but with
k
differences. Extensive experiments have been conducted, which show that our method for this problem is promising.
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Fast Top-k Similar Sequence Search on DNA Databases;Information Integration and Web Intelligence;2022