Toward optimal fingerprint indexing for large scale genomics-Reference-Cited by-同舟云学术

Toward optimal fingerprint indexing for large scale genomics

Published:2021-11-05 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Agret Clément,Cazaux Bastien^ORCID,Limasset Antoine^ORCID

Abstract

MotivationTo keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index.ResultsWe present NIQKI, a novel structure using well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a matter of days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe that this approach can lead to tremendous improvement allowing fast query, scaling on extensive genomic databases.Availability and implementationWe wrote the NIQKI index as an open-source C++ library under the AGPL3 license available at https://github.com/Malfoy/NIQKI. It is designed as a user-friendly tool and comes along with usage samples

Publisher

Cold Spring Harbor Laboratory

Reference19 articles.

1. Ultrafast search of all deposited bacterial and viral genomic data

2. A unified catalog of 204,938 reference genomes from the human gut microbiome;Nature biotechnology,2021

3. Timo Bingmann , Phelim Bradley , Florian Gauger , and Zamin Iqbal . Cobs: a compact bit-sliced signature index. In International Symposium on String Processing and Information Retrieval, pages 285–303. Springer, 2019.

4. Building large updatable colored de Bruijn graphs via merging

5. N Tessa Pierce , Luiz Irber , Taylor Reiter , Phillip Brooks , and C Titus Brown . Large-scale sequence comparisons with sourmash. F1000Research, 8, 2019.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A survey of k-mer methods and applications in bioinformatics;Computational and Structural Biotechnology Journal;2024-12

2. Fractional Hitting Sets for Efficient and Lightweight Genomic Data Sketching;2023-06-24