Author:
Shibuya Yoshihiro,Comin Matteo
Abstract
Abstract
Motivation
Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling.
Results
We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy.
We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources.
Availability
https://github.com/yhhshb/yalff
Publisher
Springer Science and Business Media LLC
Subject
Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology
Reference32 articles.
1. Google Genomics - Store, process, explore and share. https://cloud.google.com/genomics/.
2. Ewing B, Hillier L, Wendl MC, Green P. Base-Calling of Automated Sequencer Traces UsingPhred, I. Accuracy Assessment. Genome Res. 1998; 8(3):175–85. https://doi.org/10.1101/gr.8.3.175.
3. Comin M, Leoni A, Schimd M. Qcluster: Extending alignment-free measures with quality values for reads clustering In: Brown D, Morgenstern B, editors. Algorithms in Bioinformatics. Berlin, Heidelberg: Springer: 2014. p. 1–13.
4. Comin M, Leoni A, Schimd M. Clustering of reads with alignment-free measures and quality values. Algoritm Mol Biol. 2015; 10(1):1–10.
5. Schimd M, Comin M. Fast comparison of genomic and meta-genomic reads with alignment-free measures based on quality values. BMC Med Genomics. 2016; 9(1):41–50.
Cited by
10 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Parallel Lossy Compression for Large FASTQ Files;Biomedical Engineering Systems and Technologies;2023
2. ACO:lossless quality score compression based on adaptive coding order;BMC Bioinformatics;2022-06-07
3. K2Mem: Discovering Discriminative K-mers From Sequencing Data for Metagenomic Reads Classification;IEEE/ACM Transactions on Computational Biology and Bioinformatics;2022-01-01
4. Efficient k-mer Indexing with Application to Mapping-free SNP Genotyping;Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies;2022
5. Lossy Compressor Preserving Variant Calling through Extended BWT;Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies;2022