Back to sequences: find the origin ofk-mers


Baire Anthony,Peterlongo PierreORCID


AbstractA vast majority of bioinformatics tools dedicated to the treatment of raw sequencing data heavily use the concept ofk-mers. This enables us to reduce the data redundancy (and thus the memory pressure), to discard sequencing errors, and to dispose of objects of fixed size that can be manipulated and easily compared to each others. A drawback is that the link between eachk-mer and the original set of sequences it belongs to is generally lost. Given the volume of data considered in this context, finding back this association is costly. In this work, we present “back_to_sequences”, a simple tool designed to index a set ofk-mers of interests, and to stream a set of sequences, extracting those containing at least one of the indexedk-mer. In addition, the number of occurrences ofk-mers in the sequences is provided. Our results show thatback_to_sequencesstreams200 short read per millisecond, enabling to searchk-mers in hundreds of millions of reads in a matter of a few


Cold Spring Harbor Laboratory

Reference18 articles.

1. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing

2. Multiple comparative metagenomics using multiset k-mer counting;PeerJ Computer Science,2016

3. Andrea Cracco and Alexandru I Tomescu . Extremely fast construction and querying of compacted and colored de bruijn graphs with ggcat. Genome Research, pages gr–277615, 2023.

4. Sense from sequence reads: methods for alignment and assembly;Nature methods,2009

5. Geoff Greer . The Silver Searcher. silver_searcher, 2020. [Online; accessed 24-October-2023].







Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3