Design of Worst-Case-Optimal Spaced Seeds-Reference-Cited by-同舟云学术

Design of Worst-Case-Optimal Spaced Seeds

Published:2023-11-21 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Rahmann Sven^ORCID,Zentgraf Jens^ORCID

Abstract

AbstractRead mapping (and alignment) is a fundamental problem in biological sequence analysis. For speed and computational efficiency, many popular read mappers tolerate only a few differences between the read and the corresponding part of the reference genome, which leads to reference bias: Reads with too many difference are not guaranteed to be mapped correctly or at all, because to even consider a genomic position, a sufficiently longexactmatch (seed) must exist.While pangenomes and their graph-based representations provide one way to avoid reference bias by enlarging the reference, we explore an orthogonal approach and consider stronger substitution-tolerant primitives, namelyspaced seedsor gappedk-mers. Given two integersk ≤ w, one considerskselected positions, described by amask, from each length-wwindow in a sequence. In the existing literature, masks with certainprobabilisticguarantees have been designed for small values ofk.Here, for the first time, we take a combinatorial approach from aworst-caseperspective. For any mask, using integer linear programs, we find least favorable distributions of sequence changes in two different senses: (1) minimizing the number of unchanged windows; (2) minimizing the number of positions covered by unchanged windows. Then, among all masks of a given shape (k, w), we find the set of best masks that maximize these minima. As a result, we obtain highly robust masks, even for large numbers of changes. Their advantages are illustrated in two ways: First, we provide a new challenge dataset of simulated DNA reads, on which current methods like bwa-mem2, minimap2, or strobealign struggle to find seeds, and therefore cannot produce alignments against the human t2t reference genome, whereas we are able to find the correct location from a few unique spaced seeds. Second, we use real DNA data from the highly diverse human HLA region, which we are able to map correctly based on a few exactly matching spaced seeds of well-chosen masks, without evaluating alignments.

Publisher

Cold Spring Harbor Laboratory

Reference23 articles.

1. OPTIMAL SPACED SEEDS FOR HOMOLOGOUS CODING REGIONS

2. Better filtering with gapped q-grams;Fundam. Informaticae,2003

3. Spaced seeds improvek-mer-based metagenomic classification

4. Reference flow: reducing reference bias using multiple population genomes

5. Hit integration for identifying optimal spaced seeds;BMC Bioinformatics,2010