Author:
Shiryev Sergey A.,Agarwala Richa
Abstract
ABSTRACTSearching vast and rapidly growing sets of nucleotide content in data resources, such as runs in Sequence Read Archive and assemblies for whole genome shotgun sequencing projects in GenBank, is currently impractical in any reasonable amount of time or resources available to most researchers. We present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects that have short sequence matches to a user query with well-defined guarantees. Reported subjects are ranked using a score that considers the informativeness of the matches. Six databases that index over 3.5 petabases were created and used to illustrate the functionality of Pebblescout. Here we show that Pebblescout provides new research opportunities and a data-driven way for finding relevant subsets of large nucleotide resources for analysis, some of which are missed when relying only on sample metadata or tools using pre-defined reference sequences. For two computationally intensive published studies, we show that Pebblescout rejects a significant number of runs analyzed without changing the conclusions of these studies and finds additional relevant runs. A pilot web service for interactively searching the six databases is freely available athttps://pebblescout.ncbi.nlm.nih.gov/
Publisher
Cold Spring Harbor Laboratory
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献