Abstract
AbstractRead mapping (and alignment) is a fundamental problem in biological sequence analysis. For speed and computational efficiency, many popular read mappers tolerate only a few differences between the read and the corresponding part of the reference genome, which leads to reference bias: Reads with too many difference are not guaranteed to be mapped correctly or at all, because to even consider a genomic position, a sufficiently longexactmatch (seed) must exist.While pangenomes and their graph-based representations provide one way to avoid reference bias by enlarging the reference, we explore an orthogonal approach and consider stronger substitution-tolerant primitives, namelyspaced seedsor gappedk-mers. Given two integersk ≤ w, one considerskselected positions, described by amask, from each length-wwindow in a sequence. In the existing literature, masks with certainprobabilisticguarantees have been designed for small values ofk.Here, for the first time, we take a combinatorial approach from aworst-caseperspective. For any mask, using integer linear programs, we find least favorable distributions of sequence changes in two different senses: (1) minimizing the number of unchanged windows; (2) minimizing the number of positions covered by unchanged windows. Then, among all masks of a given shape (k, w), we find the set of best masks that maximize these minima. As a result, we obtain highly robust masks, even for large numbers of changes. Their advantages are illustrated in two ways: First, we provide a new challenge dataset of simulated DNA reads, on which current methods like bwa-mem2, minimap2, or strobealign struggle to find seeds, and therefore cannot produce alignments against the human t2t reference genome, whereas we are able to find the correct location from a few unique spaced seeds. Second, we use real DNA data from the highly diverse human HLA region, which we are able to map correctly based on a few exactly matching spaced seeds of well-chosen masks, without evaluating alignments.
Publisher
Cold Spring Harbor Laboratory