Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach-Reference-Cited by-同舟云学术

Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

Published:2023-02-06 Issue:1 Volume:13 Page:
ISSN:2045-2322
Container-title:Scientific Reports
language:en
Short-container-title:Sci Rep

Author:

Meng Qingxi,Chandak Shubham,Zhu Yifan,Weissman Tsachy

Abstract

AbstractThe amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files since most existing tools are either general-purpose or specialized for short read data. We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring, which focuses only on the base sequences in the FASTQ file, uses just 0.35–0.65 bits per base which is 3–6

$$\times$$

× lower than general purpose compressors like gzip. NanoSpring is competitive in compression ratio and compression resource usage with the state-of-the-art tool CoLoRd while being significantly faster at decompression when using multiple threads (> 4

$$\times$$

× faster decompression with 20 threads). NanoSpring is available on GitHub at https://github.com/qm2/NanoSpring.

Funder

Philips Research Americas

Publisher

Springer Science and Business Media LLC

Subject

Multidisciplinary

Link

https://www.nature.com/articles/s41598-023-29267-8.pdf

Reference23 articles.

1. Chandak, S. et al. SPRING: A next-generation compressor for FASTQ data. Bioinformatics 35, 2674–2676. https://doi.org/10.1093/bioinformatics/bty1015 (2019).