FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model-Reference-Cited by-同舟云学术

FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model

Published:2021-10-08 Issue:2 Volume:38 Page:351-356
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Lee Dohyeon¹,Song Giltae¹^ORCID

Affiliation:

1. School of Computer Science and Engineering, Pusan National University, Busan 46241, South Korea

Abstract

Abstract Motivation Over the past decades, vast amounts of genome sequencing data have been produced, requiring an enormous level of storage capacity. The time and resources needed to store and transfer such data cause bottlenecks in genome sequencing analysis. To resolve this issue, various compression techniques have been proposed to reduce the size of original FASTQ raw sequencing data, but these remain suboptimal. Long-read sequencing has become dominant in genomics, whereas most existing compression methods focus on short-read sequencing only. Results We designed a compression algorithm based on read reordering using a novel scoring model for reducing FASTQ file size with no information loss. We integrated all data processing steps into a software package called FastqCLS and provided it as a Docker image for ease of installation and execution to help users easily install and run. We compared our method with existing major FASTQ compression tools using benchmark datasets. We also included new long-read sequencing data in this validation. As a result, FastqCLS outperformed in terms of compression ratios for storing long-read sequencing data. Availability and implementation FastqCLS can be downloaded from https://github.com/krlucete/FastqCLS. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

Institute of Information & Communications Technology Planning & Evaluation

Korea government

Artificial Intelligence Convergence Research Center

National Research Foundation of Korea (NRF) grant funded by the Korea government

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

http://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btab696/40934552/btab696.pdf

Reference32 articles.

1. Single cells make big data: new challenges and opportunities in transcriptomics;Angerer;Curr. Opin. Syst. Biol,2017

2. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph;Benoit;BMC Bioinformatics,2015

3. Compression of FASTQ and SAM format sequencing data;Bonfield;PLoS One,2013

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping;Bioinformatics;2024-05-01

2. PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering;BMC Bioinformatics;2023-11-30

3. CSNMG: constructing sequence neighbourhood mapping graphs to compress FASTQ files;International Conference on Intelligent Systems, Communications, and Computer Networks (ISCCN 2023);2023-06-16

4. Portable nanopore-sequencing technology: Trends in development and applications;Frontiers in Microbiology;2023-02-01