Abstract
AbstractAn important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only.We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity.An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks.
Publisher
Cold Spring Harbor Laboratory
Reference41 articles.
1. Whole-Genome Alignment and Comparative Annotation
2. Vertebrate Genomes Project. https://vertebrategenomesproject.org/ [Accessed: 2020-2-21] (2020)
3. PatternHunter: faster and more sensitive homology search
4. Better filtering with gapped q-grams;Fundamenta informaticae,2003