Data structures based on k-mers for querying large collections of sequencing datasets-Reference-Cited by-同舟云学术

Data structures based on k-mers for querying large collections of sequencing datasets

Published:2019-12-06 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Marchet Camille^ORCID,Boucher Christina,Puglisi Simon J,Medvedev Paul,Salson Mikaël,Chikhi Rayan

Abstract

High-throughput sequencing datasets are usually deposited in public repositories, e.g. the European Nucleotide Archive, to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow to perform online sequence searches; yet such a feature would be highly useful to investigators. Towards this goal, in the last few years several computational approaches have been introduced to index and query large collections of datasets. Here we propose an accessible survey of these approaches, which are generally based on representing datasets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.

Publisher

Cold Spring Harbor Laboratory

Reference67 articles.

1. Almeida, A. , Nayfach, S. , Boland, M. , Strozzi, F. , Beracochea, M. , Shi, Z. J. , Pollard, K. S. , Sakharova, E. , Parks, D. H. , Hugenholtz, P. , et al. (2020). A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature Biotechnology, pages 1–10.

2. Almodaresi, F. , Pandey, P. , Ferdman, M. , Johnson, R. , and Patro, R. (2019). An efficient, scalable and exact representation of high-dimensional color information enabled via de bruijn graph search. In International Conference on Research in Computational Molecular Biology, pages 1–18. Springer.

3. Almodaresi, F. , Pandey, P. , and Patro, R. (2017). Rainbowfish: A succinct colored de Bruijn graph representation. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

4. A space and time-efficient index for the compacted colored de Bruijn graph;Bioinformatics,2018

5. Don’t thrash: How to cache your hash on flash;PVLDB,2012

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. General-purpose GPU hashing data structures and their application in accelerated genomics;Journal of Parallel and Distributed Computing;2022-05

2. Scalable Text Index Construction;Lecture Notes in Computer Science;2022

3. Disk compression of k-mer sets;Algorithms for Molecular Biology;2021-06-21

4. Kmerator Suite: design of specific k-mer signatures and automatic metadata discovery in large RNA-Seq datasets;2021-05-22

5. Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences;2020-10-08