Seeding with minimized subsequence-Reference-Cited by-同舟云学术

Seeding with minimized subsequence

Published:2023-06-01 Issue:Supplement_1 Volume:39 Page:i232-i241
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Li Xiang¹,Shi Qian¹,Chen Ke¹,Shao Mingfu¹²

Affiliation:

1. Department of Computer Science and Engineering, The Pennsylvania State University, University Park , PA 16802, USA

2. Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park , PA 16802, USA

Abstract

Abstract Motivation Modern methods for computation-intensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regular-length seeds so that compact data structures and efficient algorithms can be employed to handle the ever-growing large-scale data. Seeding methods using kmers (substrings of length k) have gained tremendous success in processing sequencing data with low mutation/error rates. However, they are much less effective for sequencing data with high error rates as kmers cannot tolerate errors. Results We propose SubseqHash, a strategy that uses subsequences, rather than substrings, as seeds. Formally, SubseqHash maps a string of length n to its smallest subsequence of length k, k < n, according to a given order overall length-k strings. Finding the smallest subsequence of a string by enumeration is impractical as the number of subsequences grows exponentially. To overcome this barrier, we propose a novel algorithmic framework that consists of a specifically designed order (termed ABC order) and an algorithm that computes the minimized subsequence under an ABC order in polynomial time. We first show that the ABC order exhibits the desired property and the probability of hash collision using the ABC order is close to the Jaccard index. We then show that SubseqHash overwhelmingly outperforms the substring-based seeding methods in producing high-quality seed-matches for three critical applications: read mapping, sequence alignment, and overlap detection. SubseqHash presents a major algorithmic breakthrough for tackling the high error rates and we expect it to be widely adapted for long-reads analysis. Availability and implementation SubseqHash is freely available at https://github.com/Shao-Group/subseqhash.

Funder

National Science Foundation

National Institutes of Health

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

https://academic.oup.com/bioinformatics/article-pdf/39/Supplement_1/i232/50741374/btad218.pdf

Reference40 articles.

1. Chaining algorithms for multiple genome comparison;Abouelhoda;J Discrete Algorithms,2005

2. Basic local alignment search tool;Altschul;J Mol Biol,1990

3. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs;Altschul;Nucleic Acids Res,1997

4. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads;Bankevich;Nat Biotechnol,2022

5. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing;Berlin;Nat Biotechnol,2015

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Learning locality-sensitive bucketing functions;Bioinformatics;2024-06-28

2. Efficient Seeding for Error-Prone Sequences with SubseqHash2;2024-06-03