Abstract
AbstractPublic sequencing databases contain vast amounts of biological information, yet they are largely underutilized as one cannot efficiently search them for any sequence(s) of interest. We presentkmindex, an innovative approach that can index thousands of highly complex metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%,kmindexoutperforms the precision of existing approaches by four orders of magnitude. We demonstrate the scalability ofkmindexby successfully indexing 1,393 complex marine seawater metagenome samples from theTaraOceans project. Additionally, we introduce the publicly accessible web server “Ocean Read Atlas” (ORA) athttps://ocean-read-atlas.mio.osupytheas.fr/, which enables real-time queries on theTaraOceans dataset. The open-sourcekmindexsoftware is available athttps://github.com/tlemane/kmindex.
Publisher
Cold Spring Harbor Laboratory
Reference25 articles.
1. Deep ocean metagenomes provide insight into the metabolic architecture of bathypelagic microbial communities;Communications Biology,2022
2. Jarno N Alanko , Jaakko Vuohtoniemi , Tommi Mäklin , and Simon J Puglisi . Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. bioRxiv, pages 2023–02, 2023.
3. Timo Bingmann , Phelim Bradley , Florian Gauger , and Zamin Iqbal . Cobs: a compact bit-sliced signature index. In String Processing and Information Retrieval: 26th International Symposium, SPIRE 2019, Segovia, Spain, October 7–9, 2019, Proceedings 26, pages 285–303. Springer, 2019.
4. Space/time trade-offs in hash coding with allowable errors
5. Data structures to represent a set of k-long dna sequences;ACM Computing Surveys (CSUR),2021