The mod-minimizer: a simple and efficient sampling algorithm for longk-mers


Koerkamp Ragnar GrootORCID,Pibiri Giulio ErmannoORCID


AbstractMotivationGiven a stringS, aminimizerscheme is an algorithm defined by a triple (k, w, 𝒪) that samples a subset ofk-mers (k-long substrings) from a stringS. Specifically, it samples the smallestk-mer according to the order 𝒪 from each window ofwconsecutivek-mers inS. Because consecutive windows can sample the samek-mer, the set of the sampledk-mers is typically much smaller thanS. More generally, we consider substring sampling algorithms that respect awindow guarantee: at least onek-mer must be sampled from every window ofwconsecutivek-mers. As a sampledk-mer is uniquely identified by its absolute position inS, we can define thedensityof a sampling algorithm as the fraction of distinct sampled positions. Good methods have low density which, by respecting the window guarantee, is lower bounded by 1/w. It is however difficult to design a sequence-agnostic algorithm with provably optimal density. In practice, the order 𝒪 is usually implemented using a pseudo-random hash function to obtain the so-calledrandomminimizer. This scheme is simple to implement, very fast to compute even in streaming fashion, and easy to analyze. However, its density is almost a factor of 2 away from the lower bound for large windows.MethodsIn this work we introducemod-sampling, a two-step sampling algorithm to obtain new minimizer schemes. Given a (small) parametert, the mod-sampling algorithm finds the positionpof the smallestt-mer in a window. It then samples thek-mer at positionpmodw. Thelr-minimizerusest=k − wand themod-minimizerusest ≡ k(modw).ResultsThese new schemes have provably lower density than random minimizers and other schemes whenkis large compared tow, while being as fast to compute. Importantly, the mod-minimizer achieves optimal density whenk → ∞. Although the mod-minimizer is not the first method to achieve optimal density for largek, its proof of optimality is simpler than previous work.We provide pseudocode for a number of other methods and compare to them. In practice, the mod-minimizer has considerably lower density than the random minimizer and other state-of-the-art methods, like closed syncmers and miniception, whenk > w. We plugged the mod-minimizer into SSHash, ak-mer dictionary based on minimizers. For default parameters (w, k) = (11, 21), space usage decreases by 15% when indexing the whole human genome (GRCh38), while maintaining its fast query time.2012 ACM Subject ClassificationTheory of computation → Sketching and sampling; Applied computing → BioinformaticsDigital Object Identifier10.4230/LIPIcs.WABI.2024.11Supplementary MaterialSoftware (C++): (Rust): Groot Koerkamp: ETH Research Grant ETH-1721-1 to Gunnar Rätsch.Giulio Ermanno Pibiri: European Union’s Horizon Europe research and innovation programme (EFRA project, Grant Agreement Number 101093026). This work was also partially supported by DAIS – Ca’ Foscari University of Venice within the IRIDE program.


Cold Spring Harbor Laboratory







