Abstract
AbstractSequencing data is rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose Counting de Bruijn graphs (Counting DBGs), a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a k-mer count or its positions). Counting DBGs index k-mer abundances from 2,652 human RNA-Seq samples in over 8-fold smaller representations compared to state-of-the-art bioinformatics tools and yet faster to construct and query. Furthermore, Counting DBGs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-Seq and 57% smaller for PacBio HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI’s SRA (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.4-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools employing de Bruijn graphs and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.
Publisher
Cold Spring Harbor Laboratory
Reference46 articles.
1. A Almeida , S Nayfach , M Boland , F Strozzi , M Beracochea , ZJ Shi , KS Pollard , E Sakharova , DH Parks , P Hugenholtz , et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature Biotechnology. 39: 105–114.
2. F Almodaresi , P Pandey , M Ferdman , R Johnson , and R Patro . An efficient, scalable, and exact representation of high-dimensional color information enabled using de Bruijn graph search. Journal of Computational Biology. 27: 485–499.
3. F Almodaresi , H Sarkar , A Srivastava , and R Patro . A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics. 34: i169–i177.
4. F Almodaresi , M Zakeri , and R Patro . PuffAligner: a fast, efficient and accurate aligner based on the Pufferfish index. Bioinformatics. 37: 4048–4055.
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献