Author:
Alanko Jarno,Slizovskiy Ilya,Lokshtanov Daniel,Gagie Travis,Noyes Noelle,Boucher Christina
Abstract
AbstractBait-enriched sequencing is a relatively new sequencing protocol that is becoming increasingly ubiquitous as it has been shown to successfully amplify regions of interest in metagenomic samples. In this method, a set of synthetic probes (“baits”) are designed, manufactured, and applied to fragmented metagenomic DNA. The probes bind to the fragmented DNA and any unbound DNA is rinsed away, leaving the bound fragments to be amplified for sequencing. This effectively enriches the DNA for which the probes were designed. Most recently, Metsky et al. (Nature Biotech 2019) demonstrated that bait-enrichment is capable of detecting a large number of human viral pathogens within metagenomic samples. In this work, we formalize the problem of designing baits by defining the Minimum Bait Cover problem, which aims to find the smallest possible set of bait sequences that cover every position of a set of reference sequences under an approximate matching model. We show that the problem is NP-hard, and that it remains NP-hard under very restrictive assumptions. This indicates that no polynomial-time exact algorithm exists for the problem, and that the problem is intractable even for small and deceptively simple inputs. In light of this, we design an efficient heuristic that takes advantage of succinct data structures. We refer to our method as syotti. The running time of syotti shows linear scaling in practice, running at least an order of magnitude faster than state-of-the-art methods, including the recent method of Metsky et al. At the same time, our method produces bait sets that are smaller than the ones produced by the competing methods, while also leaving fewer positions uncovered. Lastly, we show that syotti requires only 25 minutes to design baits for a dataset comprised of 3 billion nucleotides from 1000 related bacterial substrains, whereas the method of Metsky et al. shows clearly super-linear running time and fails to process even a subset of 8% of the data in 24 hours. Our implementation is publicly available at https://github.com/jnalanko/syotti.
Publisher
Cold Spring Harbor Laboratory
Reference20 articles.
1. Food safety and inspection service (FSIS). Serotypes profile of Salmonella isolates from meat and poultry products January 1998 through December 2014. United States Department of Agriculture. 2015. https://www.fsis.usda.gov/sites/default/files/media_file/2020-10/Salmonella-Serotype-Annual-2014.pdf. Retrieved on 28. November 2021.
2. MrBait: universal identification and design of targeted-enrichment capture probes
3. The Complexity of the Minimum k-Cover Problem;J. Autom. Lang. Comb.,2005
4. Metagenomic sequencing with spiked primer enrichment for viral diagnostics and genomic surveillance;Nature Microbiol.,2020
5. Associating sporadic, foodborne illness caused by shiga toxin-producing escherichia coli with specific foods: a systematic review and meta-analysis of case-control studies;Epidemiol. Infect.,2019
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献