PgRC: pseudogenome-based read compressor-Reference-Cited by-同舟云学术

PgRC: pseudogenome-based read compressor

Published:2019-12-09 Issue:7 Volume:36 Page:2082-2089
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Kowalski Tomasz M¹^ORCID,Grabowski Szymon¹^ORCID

Affiliation:

1. Institute of Applied Computer Science, Lodz University of Technology, Lodz 90-924, Poland

Abstract

Abstract Motivation The amount of sequencing data from high-throughput sequencing technologies grows at a pace exceeding the one predicted by Moore’s law. One of the basic requirements is to efficiently store and transmit such huge collections of data. Despite significant interest in designing FASTQ compressors, they are still imperfect in terms of compression ratio or decompression resources. Results We present Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. Experiments show that PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 15 and 20% on average, respectively, while being comparably fast in decompression. Availability and implementation PgRC can be downloaded from https://github.com/kowallus/PgRC. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

Smart Growth Operational Program

Polish National Centre for Research and Development

Institute of Applied Computer Science

Lodz University of Technology

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

http://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btz919/31668265/btz919.pdf

Reference29 articles.

1. Greedy shortest common superstring approximation in compact space;Alanko,2017

2. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph;Benoit;BMC Bioinformatics,2015

3. Compression of FASTQ and SAM format sequencing data;Bonfield;PLoS One,2013

4. SPRING: a next-generation compressor for FASTQ data;Chandak;Bioinformatics,2019

5. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis;Chandak;Bioinformatics,2018

Cited by 14 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Genie: the first open-source ISO/IEC encoder for genomic data;Communications Biology;2024-05-09

2. A compressive seeding algorithm in conjunction with reordering-based compression;Bioinformatics;2024-02-20

3. PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering;BMC Bioinformatics;2023-11-30

4. A new efficient referential genome compression technique for FastQ files;Functional & Integrative Genomics;2023-11-11

5. CSNMG: constructing sequence neighbourhood mapping graphs to compress FASTQ files;International Conference on Intelligent Systems, Communications, and Computer Networks (ISCCN 2023);2023-06-16