Spaced Seed Data Structures forDe NovoAssembly-Reference-Cited by-同舟云学术

Spaced Seed Data Structures forDe NovoAssembly

Published:2015 Issue: Volume:2015 Page:1-8
ISSN:2314-436X
Container-title:International Journal of Genomics
language:en
Short-container-title:International Journal of Genomics

Author:

Birol Inanç¹,Chu Justin¹,Mohamadi Hamid¹,Jackman Shaun D.¹,Raghavan Karthika¹,Vandervalk Benjamin P.¹,Raymond Anthony¹,Warren René L.¹

Affiliation:

1. Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada V5Z 4S6

Abstract

De novoassembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.

Funder

Genome Canada

Publisher

Hindawi Limited

Subject

Pharmaceutical Science,Genetics,Molecular Biology,Biochemistry

Link

http://downloads.hindawi.com/journals/ijg/2015/196591.pdf

Reference38 articles.

1. The ClinSeq Project: Piloting large-scale genome sequencing for research in genomic medicine

2. Digital Fetal Aneuploidy Diagnosis by Next-Generation Sequencing

3. Frequent mutation of histone-modifying genes in non-Hodgkin lymphoma

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. ALeS: adaptive-length spaced-seed design;Bioinformatics;2020-12-07

2. RNA-Bloom enables reference-free and reference-guided sequence assembly for single-cell transcriptomes;Genome Research;2020-08

3. Calibrating Seed-Based Heuristics to Map Short Reads With Sesame;Frontiers in Genetics;2020-06-25

4. Skip-mers: increasing entropy and sensitivity to detect conserved genic regions with simple cyclic q-grams;2017-08-23

5. Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds;Algorithms for Molecular Biology;2017-02-14