Affiliation:
1. Technische Universität München, Germany
2. Purdue University, IN
Abstract
This article presents a new algorithm for finding oligonucleotide signatures that are specific and sensitive for organisms or groups of organisms in large-scale sequence datasets. We assume that the organisms have been organized in a hierarchy, for example, a phylogenetic tree. The resulting signatures, binding sites for primers and probes, match the maximum possible number of organisms in the target group while having at most
k
matches outside of the target group.
The key step in the algorithm is the use of the lowest common ancestor (LCA) to search the organism hierarchy; this allows the combinatorial problem in almost linear time (empirically observed) to be solved. The presented algorithm improves performance by several orders of magnitude in terms of both memory consumption and runtime when compared to the best-known previous algorithms while giving identical, exact solutions.
This article gives a formal description of the algorithm, discusses details of our concrete, publicly available implementation, and presents the results from our performance evaluation.
Funder
Qatar Foundation
Division of Computer and Network Systems
Division of Computing and Communication Foundations
Publisher
Association for Computing Machinery (ACM)
Subject
Theoretical Computer Science