Scalable sequence database search using partitioned aggregated Bloom comb trees-Reference-Cited by-同舟云学术

Scalable sequence database search using partitioned aggregated Bloom comb trees

Published:2023-06-01 Issue:Supplement_1 Volume:39 Page:i252-i259
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Marchet Camille¹,Limasset Antoine¹

Affiliation:

1. University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL , F-59000 Lille, France

Abstract

Abstract Motivation The Sequence Read Archive public database has reached 45 petabytes of raw sequences and doubles its nucleotide content every 2 years. Although BLAST-like methods can routinely search for a sequence in a small collection of genomes, making searchable immense public resources accessible is beyond the reach of alignment-based strategies. In recent years, abundant literature tackled the task of finding a sequence in extensive sequence collections using k-mer-based strategies. At present, the most scalable methods are approximate membership query data structures that combine the ability to query small signatures or variants while being scalable to collections up to 10 000 eukaryotic samples. Results. Here, we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3–6 fold improvement in construction time compared to other compressed methods for comparable index size. A PAC query can need single random access and be performed in constant time in favorable instances. Using limited computation resources, we built PAC for very large collections. They include 32 000 human RNA-seq samples in 5 days, the entire GenBank bacterial genome collection in a single day for an index size of 3.5 TB. The latter is, to our knowledge, the largest sequence collection ever indexed using an approximate membership query structure. We also showed that PAC’s ability to query 500 000 transcript sequences in less than an hour. Availability and implementation PAC’s open-source software is available at https://github.com/Malfoy/PAC.

Funder

Agence Nationale de la recherche

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

https://academic.oup.com/bioinformatics/article-pdf/39/Supplement_1/i252/50741675/btad225.pdf

Reference32 articles.

1. Succinct dynamic de Bruijn graphs;Alipanahi;Bioinformatics,2021

2. A space and time-efficient index for the compacted colored de Bruijn graph;Almodaresi;Bioinformatics,2018

3. Basic local alignment search tool;Altschul;J Mol Biol,1990

4. Bidirectional variable-order de Bruijn graphs;Belazzougui;Int J Found Comput Sci,2018

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A survey of k-mer methods and applications in bioinformatics;Computational and Structural Biotechnology Journal;2024-12

2. LexicMap: efficient sequence alignment against millions of prokaryotic genomes;2024-08-31

3. MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29

4. Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA;Nature Computational Science;2024-02-26

5. Data Storage, collection, and Transmission in Smart Agriculture Using Bloom Filter;2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT);2023-07-06