On the string matching with k differences in DNA databases-Reference-Cited by-同舟云学术

On the string matching with k differences in DNA databases

Published:2021-02 Issue:6 Volume:14 Page:903-915
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Chen Yangjun¹,Nguyen Hoang Hai¹

Affiliation:

1. University of Winnipeg, Canada

Abstract

In this paper, we discuss an efficient and effective index mechanism for the string matching with k differences, by which we will find all the substrings of a target string y of length n that align with a pattern string x of length m with not more than k insertions, deletions, and mismatches. A typical application is the searching of a DNA database, where the size of a genome sequence in the database is much larger than that of a pattern. For example, n is often on the order of millions or billions while m is just a hundred or a thousand. The main idea of our method is to transform y to a BWT-array as an index, denoted as BWT ( y ), and search x against it. The time complexity of our method is bounded by O( k · | T |), where T is a tree structure dynamically generated during a search of BWT ( y ). The average value of | T | is bounded by O(|Σ| 2 k ), where Σ is an alphabet from which we take symbols to make up target and pattern strings. This time complexity is better than previous strategies when k ≤ O(log |Σ| n ). The general working process consists of two steps. In the first step, x is decomposed into a series of l small subpatterns, and BWT ( y ) is utilized to speedup the process to figure out all the occurrences of such subpatterns with ⌊ k/l ⌋ differences. In the second step, all the found occurrences in the first step will be rechecked to see whether they really match x , but with k differences. Extensive experiments have been conducted, which show that our method for this problem is promising.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3447689.3447695

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Fast Top-k Similar Sequence Search on DNA Databases;Information Integration and Web Intelligence;2022