Scalable genomics: from raw data to aligned reads on Apache YARN-Reference-Cited by-同舟云学术

Scalable genomics: from raw data to aligned reads on Apache YARN

Published:2016-08-23 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Versaci Francesco,Pireddu Luca^ORCID,Zanetti Gianluigi

Abstract

AbstractThe adoption of Big Data technologies can potentially boost the scalability of data-driven biology and health workflows by orders of magnitude. Consider, for instance, that technologies in the Hadoop ecosystem have been successfully used in data-driven industry to scale their processes to levels much larger than any biological- or health-driven work attempted thus far. In this work we demonstrate the scalability of a sequence alignment pipeline based on technologies from the Hadoop ecosystem – namely, Apache Flink and Hadoop MapReduce, both running on the distributed Apache YARN platform. Unlike previous work, our pipeline starts processing directly from the raw BCL data produced by Illumina sequencers. A Flink-based distributed algorithm reconstructs reads from the Illumina BCL data, and then demultiplexes them – analogously to the bcl2fastq2 program provided by Illumina. Subsequently, the BWA-MEM-based distributed aligner from the Seal project is used to perform read mapping on the YARN platform. While the standard programs by Illumina and BWA-MEM are limited to shared-memory parallelism (multi-threading), our solution is completely distributed and can scale across a large number of computing nodes. Results show excellent pipeline scalability, linear in the number of nodes. In addition, this approach automatically benefits from the robustness to hardware failure and transient cluster problems provided by the YARN pipeline, as well as the scalability of the Hadoop Distributed File System. Moreover, this YARN-based approach complements the up-and-coming version 4 of the GATK toolkit, which is based on Spark and therefore can run on YARN. Together, they can be used to form a scalable complete YARN-based variant calling pipeline for Illumina data, which will be further improved with the arrival of distributed in-memory filesystem technology such as Apache Arrow, thus removing the need to write intermediate data to disk.

Publisher

Cold Spring Harbor Laboratory

Reference50 articles.

1. V. Marx , “Biology: The big challenges of big data”, Nature, vol. 498, June 2013.

2. The Fourth Paradigm: Data-Intensive Scientific Discovery;Proceedings of the IEEE,2011

3. The expanding scope of DNA sequencing

4. Genetic Variants Regulating Immune Cell Levels in Health and Disease

5. Tumour heterogeneity in the clinic