Affiliation:
1. Institute of Applied Computer Science, Lodz University of Technology, ul. Stefanowskiego 18, 90-537 Lodz, Poland
Abstract
Abstract
Background
Genomes within the same species reveal large similarity, exploited by specialized multiple genome compressors. The existing algorithms and tools are however targeted at large, e.g., mammalian, genomes, and their performance on bacteria strains is rather moderate.
Results
In this work, we propose MBGC, a specialized genome compressor making use of specific redundancy of bacterial genomes. Its characteristic features are finding both direct and reverse-complemented LZ-matches, as well as a careful management of a reference buffer in a multi-threaded implementation. Our tool is not only compression efficient but also fast. On a collection of 168,311 bacterial genomes, totalling 587 GB, we achieve a compression ratio of approximately a factor of 1,265 and compression (respectively decompression) speed of ∼1,580 MB/s (respectively 780 MB/s) using 8 hardware threads, on a computer with a 14-core/28-thread CPU and a fast SSD, being almost 3 times more succinct and >6 times faster in the compression than the next best competitor.
Funder
Lodz University of Technology
Publisher
Oxford University Press (OUP)
Subject
Computer Science Applications,Health Informatics
Reference25 articles.
1. Compression of DNA sequences;Grumbach;Proc. Data Compression Conference,1993
2. A simple statistical algorithm for biological sequence compression;Duc Cao;Proc. Data Compression Conference,2007
3. Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences;Kryukov;Gigascience,2020
4. Human genomes as email attachments;Christley;Bioinformatics,2009
5. The human genome contracts again;Pavlichin;Bioinformatics,2013
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献