Efficient de novo assembly of large genomes using compressed data structures-Reference-Cited by-同舟云学术

Efficient de novo assembly of large genomes using compressed data structures

Published:2011-12-07 Issue:3 Volume:22 Page:549-556
ISSN:1088-9051
Container-title:Genome Research
language:en
Short-container-title:Genome Res.

Author:

Simpson Jared T.,Durbin Richard

Abstract

De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.

Publisher

Cold Spring Harbor Laboratory

Subject

Genetics (clinical),Genetics

Reference28 articles.

1. Bauer MJ , Cox AJ , Rosone G . 2011. Lightweight BWT construction for very large string collections. In Proceedings of the twenty-second annual symposium, Combinatorial Pattern Matching, pp. 219–231. Springer-Verlag, Berlin, Heidelberg.

2. Accurate whole human genome sequencing using reversible terminator chemistry

3. Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies

4. Burrows M , Wheeler DJ . 1994. A block-sorting lossless data compression algorithm. Digital SRC Research Report. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.6774 .

5. Genome Sequence of the Nematode C. elegans : A Platform for Investigating Biology

Cited by 605 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. SedaDNA reveals mid-to late Holocene aquatic plant and algae changes in Luanhaizi Lake on the Tibetan Plateau;Palaeogeography, Palaeoclimatology, Palaeoecology;2024-09

2. GenArchBench: A genomics benchmark suite for arm HPC processors;Future Generation Computer Systems;2024-08

3. Untapped Potential of Poly(ADP-Ribose) Polymerase Inhibitors: Lessons Learned From the Real-World Clinical Homologous Recombination Repair Mutation Testing;World Journal of Oncology;2024-08

4. Unlocking plant genetics with telomere-to-telomere genome assemblies;Nature Genetics;2024-07-24

5. De novo transcriptome assembly and discovery of drought-responsive genes in eastern white spruce (Picea glauca);2024-06-13