SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark-Reference-Cited by-同舟云学术

SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark

Published:2020-01-03 Issue:1 Volume:11 Page:53
ISSN:2073-4425
Container-title:Genes
language:en
Short-container-title:Genes

Author:

Al-Ars Zaid,Wang Saiyi,Mushtaq Hamid

Abstract

The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4×, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results.

Publisher

MDPI AG

Subject

Genetics (clinical),Genetics

Link

https://www.mdpi.com/2073-4425/11/1/53/pdf

Reference25 articles.

1. From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline;Van der Auwera;Curr. Protoc. Bioinform.,2013

2. Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics

3. Halvade: scalable sequence analysis with MapReduce

4. MapReduce

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. GAPiM: Discovering Genetic Variations on a Real Processing-in-Memory System;2023-08-18

2. GAPiM: Discovering Genetic Variations on a Real Processing-in-Memory System;2023-07-29

3. Framing Apache Spark in life sciences;Heliyon;2023-02

4. SparkFlow: Towards High-Performance Data Analytics for Spark-based Genome Analysis;2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid);2022-05

5. Halvade somatic: Somatic variant calling with Apache Spark;GigaScience;2022