The mod-minimizer: a simple and efficient sampling algorithm for longk-mers

Author:

Koerkamp Ragnar GrootORCID,Pibiri Giulio ErmannoORCID

Abstract

AbstractMotivationGiven a stringS, aminimizerscheme is an algorithm defined by a triple (k, w, 𝒪) that samples a subset ofk-mers (k-long substrings) from a stringS. Specifically, it samples the smallestk-mer according to the order 𝒪 from each window ofwconsecutivek-mers inS. Because consecutive windows can sample the samek-mer, the set of the sampledk-mers is typically much smaller thanS. More generally, we consider substring sampling algorithms that respect awindow guarantee: at least onek-mer must be sampled from every window ofwconsecutivek-mers. As a sampledk-mer is uniquely identified by its absolute position inS, we can define thedensityof a sampling algorithm as the fraction of distinct sampled positions. Good methods have low density which, by respecting the window guarantee, is lower bounded by 1/w. It is however difficult to design a sequence-agnostic algorithm with provably optimal density. In practice, the order 𝒪 is usually implemented using a pseudo-random hash function to obtain the so-calledrandomminimizer. This scheme is simple to implement, very fast to compute even in streaming fashion, and easy to analyze. However, its density is almost a factor of 2 away from the lower bound for large windows.MethodsIn this work we introducemod-sampling, a two-step sampling algorithm to obtain new minimizer schemes. Given a (small) parametert, the mod-sampling algorithm finds the positionpof the smallestt-mer in a window. It then samples thek-mer at positionpmodw. Thelr-minimizerusest=k − wand themod-minimizerusest ≡ k(modw).ResultsThese new schemes have provably lower density than random minimizers and other schemes whenkis large compared tow, while being as fast to compute. Importantly, the mod-minimizer achieves optimal density whenk → ∞. Although the mod-minimizer is not the first method to achieve optimal density for largek, its proof of optimality is simpler than previous work.We provide pseudocode for a number of other methods and compare to them. In practice, the mod-minimizer has considerably lower density than the random minimizer and other state-of-the-art methods, like closed syncmers and miniception, whenk > w. We plugged the mod-minimizer into SSHash, ak-mer dictionary based on minimizers. For default parameters (w, k) = (11, 21), space usage decreases by 15% when indexing the whole human genome (GRCh38), while maintaining its fast query time.2012 ACM Subject ClassificationTheory of computation → Sketching and sampling; Applied computing → BioinformaticsDigital Object Identifier10.4230/LIPIcs.WABI.2024.11Supplementary MaterialSoftware (C++): github.com/jermp/minimizersSoftware (Rust): github.com/RagnarGrootKoerkamp/minimizersFundingRagnar Groot Koerkamp: ETH Research Grant ETH-1721-1 to Gunnar Rätsch.Giulio Ermanno Pibiri: European Union’s Horizon Europe research and innovation programme (EFRA project, Grant Agreement Number 101093026). This work was also partially supported by DAIS – Ca’ Foscari University of Venice within the IRIDE program.

Publisher

Cold Spring Harbor Laboratory

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3