Designing efficient randstrobes for sequence similarity analyses-Reference-Cited by-同舟云学术

Designing efficient randstrobes for sequence similarity analyses

Published:2023-10-16 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Karami Moein,Mohammadi Aryan Soltani,Martin Marcel^ORCID,Ekim Barış^ORCID,Shen Wei^ORCID,Guo Lidong,Xu Mengyang^ORCID,Pibiri Giulio Ermanno^ORCID,Patro Rob^ORCID,Sahlin Kristoffer^ORCID

Abstract

AbstractSubstrings of lengthk, commonly referred to ask-mers, play a vital role in sequence analysis, reducing the search space by providing anchors between queries and references. However,k-mers are limited to exact matches between sequences. This has led to alternative constructs, such as spacedk-mers, that can match across substitutions. We recently introduced a class of new constructs,strobemers, that can match across substitutions and smaller insertions and deletions.Randstrobes, the most sensitive strobemer proposed in [18], has been incorporated into several bioinformatics applications such as read classification, short read mapping, and read overlap detection. Randstrobes are constructed by linking togetherk-mers in a pseudo-random fashion and depend on a hash function, alink function, and a comparator for their construction. Recently, we showed that the more random this linking appears (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness will depend on the hashing, linking, and comparison operators. However, no study has investigated the efficacy of the underlying operators to produce randstrobes.In this study, we propose several new construction methods. One of our proposed methods is based on a Binary Search Tree (BST), which lowers the time complexity and practical runtime to other methods for some parametrizations. To our knowledge, we are also the first to describe and study the types of biases that occur during construction. We designed three metrics to measure the bias. Using these new evaluation metrics, we uncovered biases and limitations in previous methods and showed that our proposed methods have favorable speed and sampling uniformity to previously proposed methods. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. Also, we suggest combining the two versions to improve accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes.

Publisher

Cold Spring Harbor Laboratory

Reference23 articles.

1. Integer hash function. http://web.archive.org/web/20071223173210/http://www.concentric.net/~Ttwang/tech/inthash.htm. Accessed: 2023-07-20.

2. No hash function is perfect, but some are useful. https://github.com/wangyi-fudan/wyhash. Accessed: 2023-07-20.

3. xxHash - extremely fast hash algorithm. https://xxhash.com/. Accessed: 2023-07-20.

4. Technology dictates algorithms: recent developments in read alignment

5. Human Genome Assembly in 100 Minutes