<i>k</i>-nonical space: sketching with reverse complements-Reference-Cited by-同舟云学术

k-nonical space: sketching with reverse complements

Published:2024-01-27 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Marçais Guillaume^ORCID,Elder C.S.^ORCID,Kingsford Carl^ORCID

Abstract

AbstractSequences equivalent to their reverse complements (i.e., double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g., sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding ak-mer and its reverse complement into a single sequence: the canonical representation (k-nonical space). The effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonicalk-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome (“sketching deserts”) are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (1) a new procedure that adapts existing sketching methods tok-nonical space and (2) an optimization procedure to directly design new sketching methods fork-nonical space.The code used in this analysis is freely available athttps://github.com/Kingsford-Group/mdsscope.

Publisher

Cold Spring Harbor Laboratory

Reference26 articles.

1. Sheldon Axler . Linear algebra done right. Springer Nature, 2023.

2. Unavoidable sets of constant length;International Journal of Algebra and Computation,2004

3. Parameterized syncmer schemes improve long-read mapping;PLOS Computational Biology,2022

4. Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences

5. Barış Ekim , Bonnie Berger , and Yaron Orenstein . A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets. In Russell Schwartz , editor, Research in Computational Molecular Biology, Lecture Notes in Computer Science, pages 37–53, Cham, 2020. Springer International Publishing.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. LexicMap: efficient sequence alignment against millions of prokaryotic genomes;2024-08-31