SparkGC: Spark based genome compression for large collections of genomes-Reference-Cited by-同舟云学术

SparkGC: Spark based genome compression for large collections of genomes

Published:2022-07-25 Issue:1 Volume:23 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Yao Haichang,Hu Guangyong,Liu Shangdong,Fang Houzhi,Ji Yimu

Abstract

AbstractSince the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome Compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark’s in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available athttps://github.com/haichangyao/SparkGC.

Funder

Scientific Research Start-up Foundation of Nanjing Vocational University of Industry Technology

Modern Educational Technology Research Program of Jiangsu Province in 2022

Research Project of Chinese National Light Industry Vocational Education and Teaching Steering Committee in 2021

the National Key R&D Program of China

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/s12859-022-04825-5.pdf

Reference37 articles.