Large-scale parallel genome assembler over cloud computing environment-Reference-Cited by-同舟云学术

Large-scale parallel genome assembler over cloud computing environment

Published:2017-05-23 Issue:03 Volume:15 Page:1740003
ISSN:0219-7200
Container-title:Journal of Bioinformatics and Computational Biology
language:en
Short-container-title:J. Bioinform. Comput. Biol.

Author:

Das Arghya Kusum¹,Koppa Praveen Kumar¹,Goswami Sayan¹,Platania Richard¹,Park Seung-Jong¹

Affiliation:

1. School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, 340 East Parker Blvd, Baton Rouge, Louisiana 70803, USA

Abstract

The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.

Publisher

World Scientific Pub Co Pte Lt

Subject

Computer Science Applications,Molecular Biology,Biochemistry

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0219720017400030

Reference21 articles.

1. A simple randomized parallel algorithm for list-ranking

2. Cloud Computing in Bioinformatics

3. Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies

4. Meraculous: De Novo Genome Assembly with Short Paired-End Reads

5. Bioinformatics clouds for big data manipulation

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Distributed RMI-DBG model: Scalable iterative de Bruijn graph algorithm for short read genome assembly problem;Expert Systems with Applications;2023-12

2. RMI-DBG algorithm: A more agile iterative de Bruijn graph algorithm in short read genome assembly;Journal of Bioinformatics and Computational Biology;2021-04

3. A hybrid and scalable error correction algorithm for indel and substitution errors of long reads;BMC Genomics;2019-12

4. A Systematic Mapping Study of Cloud Large-Scale Foundation—Big Data, IoT, and Real-Time Analytics;Data Management, Analytics and Innovation;2019-10-25

5. A Cloud-aware Autonomous Workflow Engine and Its Application to Gene Regulatory Networks Inference;Proceedings of the 8th International Conference on Cloud Computing and Services Science;2018