Abstract
Kmer-based methods are widely used in bioinformatics, which raises the question of what is the smallest practically usable representation (i.e. plain text) of a set of kmers. We propose a polynomial algorithm computing a minimum such representation (which was previously posed as a potentially NP-hard open problem), as well as an efficient near-minimum greedy heuristic. When compressing genomes of large model organisms, read sets thereof or bacterial pangenomes, with only a minor runtime increase, we decrease the size of the representation by up to 60% over unitigs and 27% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 91% over previous work. Finally we show that a small representation has advantages in downstream applications, as it speeds up queries on the popular kmer indexing tool Bifrost by 1.66x over unitigs and 1.29x over previous work.
Publisher
Cold Spring Harbor Laboratory
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献