Minimizer-space de Bruijn graphs-Reference-Cited by-同舟云学术

Minimizer-space de Bruijn graphs

Published:2021-06-10 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Ekim Barış^ORCID,Berger Bonnie^ORCID,Chikhi Rayan^ORCID

Abstract

AbstractDNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates. We focus on the problem of assembling such reads into genomes, which poses challenges in terms of accuracy and computational resources when using cutting-edge assembly approaches, e.g. those based on overlapping reads using minimizer sketches. Here, we introduce the concept of minimizer-space sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet. By projecting DNA sequences into ordered lists of minimizers, our key idea is to enumerate what we call k-min-mers, that are k-mers over a larger alphabet consisting of minimizer tokens. Our approach, mdBG or minimizer-dBG, achieves orders-of-magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy. We demonstrate three uses cases of mdBG: human genome assembly, metagenome assembly, and the representation of large pangenomes. For assembly, we implemented mdBG in software we call rust-mdbg, resulting in ultra-fast, low memory and highly-contiguous assembly of PacBio HiFi reads. A human genome is assembled in under 10 minutes using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 minutes using 1 GB RAM. For pangenome graphs, we newly allow a graphical representation of a collection of 661,405 bacterial genomes as an mdBG and successfully search it (in minimizer-space) for anti-microbial resistance (AMR) genes. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics.

Publisher

Cold Spring Harbor Laboratory

Reference54 articles.

1. Batu, T. , Ergun, F. , Şahinalp, C. : Oblivious string embeddings and edit distance approximations. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms. p. 792–801. SODA ‘06, Society for Industrial and Applied Mathematics, USA (2006)

2. High-Throughput Gene Mapping in Caenorhabditis elegans

3. Computational solutions for omics data

4. The role of polygenic risk and susceptibility genes in breast cancer over the course of life

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis;2022-11-25

2. Efficient minimizer orders for large values ofkusing minimum decycling sets;2022-10-21

3. Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances;2022-01-12