Effective sequence similarity detection with strobemers-Reference-Cited by-同舟云学术

Effective sequence similarity detection with strobemers

Published:2021-10-19 Issue:11 Volume:31 Page:2080-2094
ISSN:1088-9051
Container-title:Genome Research
language:en
Short-container-title:Genome Res.

Author:

Sahlin Kristoffer^ORCID

Abstract

k-mer-based methods are widely used in bioinformatics for various types of sequence comparisons. However, a single mutation will mutate k consecutive k-mers and make most k-mer-based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, for example, spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches owing to the size of k. Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. We use simulated data to show that strobemers provide more evenly distributed sequence matches and are less sensitive to different mutation rates than k-mers and spaced k-mers. Strobemers also produce higher match coverage across sequences. We further implement a proof-of-concept sequence-matching tool StrobeMap and use synthetic and biological Oxford Nanopore sequencing data to show the utility of using strobemers for sequence comparison in different contexts such as sequence clustering and alignment scenarios.

Funder

Swedish Research Council

Publisher

Cold Spring Harbor Laboratory

Subject

Genetics (clinical),Genetics

Reference67 articles.

1. BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers

2. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

3. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing

4. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing

5. The statistics of k-mers from a sequence undergoing a simple mutation process without spurious matches

Cited by 50 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A survey of k-mer methods and applications in bioinformatics;Computational and Structural Biotechnology Journal;2024-12

2. RawHash2: mapping raw nanopore signals using hash-based seeding and adaptive quantization;Bioinformatics;2024-07-30

3. Long-read sequencing transcriptome quantification with lr-kallisto;2024-07-19

4. Efficient Seeding for Error-Prone Sequences with SubseqHash2;2024-06-03

5. The mod-minimizer: a simple and efficient sampling algorithm for longk-mers;2024-05-31