Themisto: a scalable colored<i>k</i>-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes-Reference-Cited by-同舟云学术

Themisto: a scalable coloredk-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes

Published:2023-02-24 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Alanko Jarno N.^ORCID,Vuohtoniemi Jaakko,Mäklin Tommi^ORCID,Puglisi Simon J.^ORCID

Abstract

AbstractMotivationHuge data sets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these data sets, efficient indexing data structures — that are both scalable and provide rapid query throughput — are paramount.ResultsHere, we present Themisto, a scalable coloredk-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousandSalmonella entericagenomes in 9 hours. The resulting index takes 142 gigabytes. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 thousand genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets.Availability and implementationThemisto is available and documented as a C++ package athttps://github.com/algbio/themistoavailable under the GPLv2 license.Contactjarno.alanko@helsinki.fiSupplementary informationSupplementary data are available atBioinformaticsonline.

Publisher

Cold Spring Harbor Laboratory

Reference24 articles.

1. Achtman, M. et al. (2020). Genomic diversity of salmonella enterica-the UoWUCC 10k genomes project. Wellcome Open Research, 5.

2. Alanko, J. N. et al. (2022). Succinct k-mer sets using subset rank queries on the spectral Burrows-Wheeler transform. bioRxiv.

3. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences;PLoS biology,2021

4. Bowe, A. et al. (2012). Succinct de Bruijn graphs. In International workshop on algorithms in bioinformatics, pages 225–235. Springer.

5. Near-optimal probabilistic RNA-seq quantification

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Compression algorithm for colored de Bruijn graphs;Algorithms for Molecular Biology;2024-05-26

2. Pan-genome de Bruijn graph using the bidirectional FM-index;BMC Bioinformatics;2023-10-26

3. kmindex and ORA: indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets;2023-06-04

4. Compression algorithm for colored de Bruijn graphs;2023-05-14

5. Fulgor: A fast and compactk-mer index for large-scale matching and color queries;2023-05-11