Abstract
AbstractNearly-exponential growth and heterogeneity of biological sequence data make the task of biological sequence retrieval from databases more important and challenging than ever. In this manuscript, we present a novel search algorithm involving an indexing scheme based on patterns discovered by natural language processing, i.e., short strings of nucleotides or amino acids, akin to standard k-mers, but mined from cumulative cross-species omic data repositories. More specifically, we benchmark the quality of the sequence retrieval process by comparing to BLASTP, a heuristic algorithm for the alignment of genomics or protein sequence data. The main argumentation is that to retrieve biological similar sequences it is not needed to mimic the alignment procedures as it is performed by BLAST. Our results suggests that the HYFT-indexing and searching is a good alternative and a static, alignment-free method to retrieve homologous sequence down to 50% sequence identity.
Publisher
Cold Spring Harbor Laboratory