Matchtigs: minimum plain text representation of kmer sets-Reference-Cited by-同舟云学术

Matchtigs: minimum plain text representation of kmer sets

Published:2021-12-17 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Schmidt Sebastian^ORCID,Khan Shahbaz^ORCID,Alanko Jarno^ORCID,Tomescu Alexandru I.^ORCID

Abstract

Kmer-based methods are widely used in bioinformatics, which raises the question of what is the smallest practically usable representation (i.e. plain text) of a set of kmers. We propose a polynomial algorithm computing a minimum such representation (which was previously posed as a potentially NP-hard open problem), as well as an efficient near-minimum greedy heuristic. When compressing genomes of large model organisms, read sets thereof or bacterial pangenomes, with only a minor runtime increase, we decrease the size of the representation by up to 60% over unitigs and 27% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 91% over previous work. Finally we show that a small representation has advantages in downstream applications, as it speeds up queries on the popular kmer indexing tool Bifrost by 1.66x over unitigs and 1.29x over previous work.

Publisher

Cold Spring Harbor Laboratory

Reference62 articles.

1. Rapid transcriptional plasticity of duplicated gene clusters enables a clonally reproducing aphid to colonise diverse plant species

2. Performance of neural network basecalling tools for Oxford Nanopore sequencing

3. scMC learns biological variation through the alignment of multiple single-cell genomics datasets

4. De novo assembly and genotyping of variants using colored de Bruijn graphs

5. Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time;Algorithms for Molecular Biology;2023-07-04

2. Masked superstrings as a unified framework for textualk-mer set representations;2023-02-03

3. Extremely-fast construction and querying of compacted and colored de Bruijn graphs with GGCAT;2022-10-25

4. Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time;2022-05-19