A new paradigm for biological sequence retrieval inspired by natural language processing and database research-Reference-Cited by-同舟云学术

A new paradigm for biological sequence retrieval inspired by natural language processing and database research

Published:2023-11-09 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Rousseau Axel-Jan^ORCID,Lemal Sébastien^ORCID,Korovin Yegor,Triantopoulos Georgios,Brands Ingrid^ORCID,Biemans Maxim^ORCID,Van Hyfte Dirk,Valkenborg Dirk^ORCID

Abstract

AbstractNearly-exponential growth and heterogeneity of biological sequence data make the task of biological sequence retrieval from databases more important and challenging than ever. In this manuscript, we present a novel search algorithm involving an indexing scheme based on patterns discovered by natural language processing, i.e., short strings of nucleotides or amino acids, akin to standard k-mers, but mined from cumulative cross-species omic data repositories. More specifically, we benchmark the quality of the sequence retrieval process by comparing to BLASTP, a heuristic algorithm for the alignment of genomics or protein sequence data. The main argumentation is that to retrieve biological similar sequences it is not needed to mimic the alignment procedures as it is performed by BLAST. Our results suggests that the HYFT-indexing and searching is a good alternative and a static, alignment-free method to retrieve homologous sequence down to 50% sequence identity.

Publisher

Cold Spring Harbor Laboratory

Reference47 articles.

1. The European Bioinformatics Institute in 2016: Data growth and integration

2. Big Data: Astronomical or Genomical?

3. Plewniak F. Database similarity searches. In: Functional Proteomics. Springer; 2008. p. 361–378.

4. The EMBL-EBI bioinformatics web and programmatic tools framework

5. National Center for Biotechnology Information (NCBI)[Internet];. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information, https://www.ncbi.nlm.nih.gov/ x(accessed :2020.08.25).