Indexing and searching petabyte-scale nucleotide resources-Reference-Cited by-同舟云学术

Indexing and searching petabyte-scale nucleotide resources

Published:2023-07-09 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Shiryev Sergey A.,Agarwala Richa

Abstract

ABSTRACTSearching vast and rapidly growing sets of nucleotide content in data resources, such as runs in Sequence Read Archive and assemblies for whole genome shotgun sequencing projects in GenBank, is currently impractical in any reasonable amount of time or resources available to most researchers. We present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects that have short sequence matches to a user query with well-defined guarantees. Reported subjects are ranked using a score that considers the informativeness of the matches. Six databases that index over 3.5 petabases were created and used to illustrate the functionality of Pebblescout. Here we show that Pebblescout provides new research opportunities and a data-driven way for finding relevant subsets of large nucleotide resources for analysis, some of which are missed when relying only on sample metadata or tools using pre-defined reference sequences. For two computationally intensive published studies, we show that Pebblescout rejects a significant number of runs analyzed without changing the conclusions of these studies and finds additional relevant runs. A pilot web service for interactively searching the six databases is freely available athttps://pebblescout.ncbi.nlm.nih.gov/

Publisher

Cold Spring Harbor Laboratory

Reference36 articles.

1. Ultrafast search of all deposited bacterial and viral genomic data

2. Large-scale sequence comparisons with sourmash;F1000Research,2019

3. To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics;Nucleic acids research,2020

4. Data structures based on k-mers for querying large collections of sequencing data sets

5. Petabase-scale sequence alignment catalyses viral discovery

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Viroid-like colonists of human microbiomes;2024-01-21