A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework-Reference-Cited by-同舟云学术

A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework

Published:2012-12 Issue:S7 Volume:13 Page:
ISSN:1471-2164
Container-title:BMC Genomics
language:en
Short-container-title:BMC Genomics

Author:

Chang Yu-Jung,Chen Chien-Chih,Chen Chuen-Liang,Ho Jan-Ming

Abstract

Abstract Background State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads with greater sequencing depth than those generated by previous technologies. Two major challenges exist in using the high-throughput technology for de novo assembly of genomes. First, the amount of physical memory may be insufficient to store the data structure of the assembly algorithm, even for high-end multicore processors. Moreover, the graph-theoretical model used to capture intersection relationships of the reads may contain structural defects that are not well managed by existing assembly algorithms. Results We developed a distributed genome assembler based on string graphs and MapReduce framework, known as the CloudBrush. The assembler includes a novel edge-adjustment algorithm to detect structural defects by examining the neighboring reads of a specific read for sequencing errors and adjusting the edges of the string graph, if necessary. CloudBrush is evaluated against GAGE benchmarks to compare its assembly quality with the other assemblers. The results show that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels of > 5 bp. In addition, we have introduced two measures, known as precision and recall, to address the issues of faithfully aligned contigs to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush is shown to produce contigs with high precision and recall. We also verified the effectiveness of the edge-adjustment algorithm using simulated datasets and ran CloudBrush on a nematode dataset using a commercial cloud. CloudBrush assembler is available at https://github.com/ice91/CloudBrush.

Publisher

Springer Science and Business Media LLC

Subject

Genetics,Biotechnology

Link

http://link.springer.com/content/pdf/10.1186/1471-2164-13-S7-S28.pdf

Reference23 articles.

1. Stein LD: The case for cloud computing in genome informatics. Genome Biology. 2010, 11: 207-10.1186/gb-2010-11-5-207.

2. Miller JR, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics. 2010, 95: 315-327. 10.1016/j.ygeno.2010.03.001.

3. Pevzner P, Tang H, Waterman M: Fragment assembly with double-barreled data. Proceedings of the National Academy of Sciences. 2001, 98 (17): 9748-9753. 10.1073/pnas.171285098.

4. Zerbino D, Birney E: Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs. Genome Research. 2008

5. Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Research. 2008, 18: 324-10.1101/gr.7088808.

Cited by 21 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Genome assembly and annotation;Bioinformatics;2022

2. Cloud Computing Enabled Big Multi-Omics Data Analytics;Bioinformatics and Biology Insights;2021-01

3. SMusket: Spark-based DNA error correction on distributed-memory systems;Future Generation Computer Systems;2020-10

4. SparkBeagle;Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics;2020-09-21

5. A Survey of Methods and Tools for Large-Scale DNA Mixture Profiling;Smart Infrastructure and Applications;2019-06-21