High-performance genomic analysis framework with in-memory computing-Reference-Cited by-同舟云学术

High-performance genomic analysis framework with in-memory computing

Published:2018-03-23 Issue:1 Volume:53 Page:317-328
ISSN:0362-1340
Container-title:ACM SIGPLAN Notices
language:en
Short-container-title:SIGPLAN Not.

Author:

Li Xueqi¹,Tan Guangming¹,Wang Bingchen²,Sun Ninghui²

Affiliation:

1. Institute of Computing Technology, Chinese Academy of Sciences

2. University of Chinese Academy of Sciences

Abstract

In this paper, we propose an in-memory computing framework (called GPF) that provides a set of genomic formats, APIs and a fast genomic engine for large-scale genomic data processing. Our GPF comprises two main components: (1) scalable genomic data formats and API. (2) an advanced execution engine that supports efficient compression of genomic data and eliminates redundancies in the execution engine of our GPF. We further present both system and algorithm-specific implementations for users to build genomic analysis pipeline without any acquaintance of Spark parallel programming. To test the performance of GPF, we built a WGS pipeline on top of our GPF as a test case. Our experimental data indicate that GPF completes Whole-Genome-Sequencing (WGS) analysis of 146.9G bases Human Platinum Genome in running time of 24 minutes, with over 50% parallel efficiency when used on 2048 CPU cores. Together, our GPF framework provides a fast and general engine for large-scale genomic data processing which supports in-memory computing.

Funder

The National Key Research and Development Program of China

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Graphics and Computer-Aided Design,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3200691.3178511

Reference28 articles.

1. 2016. HG19 Human Genome Download. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/. (2016). 2016. HG19 Human Genome Download. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/. (2016).

2. 2016. Picard: A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. http://broadinstitute.github.io/picard/. (2016). 2016. Picard: A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. http://broadinstitute.github.io/picard/. (2016).

3. Compression of DNA sequence reads in FASTQ format

4. Apache Software Foundation. Online. Apache Hadoop. http://hadoop.apache.org/. (Online). Apache Software Foundation. Online. Apache Hadoop. http://hadoop.apache.org/. (Online).

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. OPERA-gSAM: Big Data Processing Framework for UMI Sequencing at High Scalability and Efficiency;2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW);2023-05

2. scSpark^XMBD: High-Performance scRNA-seq Data Processing with Spark;2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM);2021-12-09

3. Scaling Genomics Data Processing with Memory-Driven Computing to Accelerate Computational Biology;Lecture Notes in Computer Science;2020