SAKE: Strobemer-assisted k-mer extraction-Reference-Cited by-同舟云学术

SAKE: Strobemer-assisted k-mer extraction

Published:2023-11-29 Issue:11 Volume:18 Page:e0294415
ISSN:1932-6203
Container-title:PLOS ONE
language:en
Short-container-title:PLoS ONE

Author:

Leinonen Miika^ORCID,Salmela Leena

Abstract

K-mer-based analysis plays an important role in many bioinformatics applications, such as de novo assembly, sequencing error correction, and genotyping. To take full advantage of such methods, the k-mer content of a read set must be captured as accurately as possible. Often the use of long k-mers is preferred because they can be uniquely associated with a specific genomic region. Unfortunately, it is not possible to reliably extract long k-mers in high error rate reads with standard exact k-mer counting methods. We propose SAKE, a method to extract long k-mers from high error rate reads by utilizing strobemers and consensus k-mer generation through partial order alignment. Our experiments show that on simulated data with up to 6% error rate, SAKE can extract 97-mers with over 90% recall. Conversely, the recall of DSK, an exact k-mer counter, drops to less than 20%. Furthermore, the precision of SAKE remains similar to DSK. On real bacterial data, SAKE retrieves 97-mers with a recall of over 90% and slightly lower precision than DSK, while the recall of DSK already drops to 50%. We show that SAKE can extract more k-mers from uncorrected high error rate reads compared to exact k-mer counting. However, exact k-mer counters run on corrected reads can extract slightly more k-mers than SAKE run on uncorrected reads.

Funder

Academy of Finland

Helsinki University Library

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference30 articles.

1. FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads;FD Pajuste;Scientific Reports,2017

2. Kraken: ultrafast metagenomic sequence classification using exact alignments;DE Wood;Genome Biology,2014

3. RAP: a new computer program for de novo identification of repeated sequences in whole genomes;D Campagna;Bioinformatics,2005

4. FORRepeats: detects repeats on entire chromosomes and between genomes;A Lefebvre;Bioinformatics,2003

5. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads;A Bankevich;Nat Biotechnol,2022