Abstract
AbstractMotivationIn this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets.ResultsWe used REINDEER to index the abundances of sequences within 2,585 human RNA-seq experiments in 45 hours using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of 4 billion distinct k-mers across 2,585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph (DBG) of each dataset, then conceptually merges those DBGs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances.Availabilityhttps://github.com/kamimrcht/REINDEERContactcamille.marchet@univ-lille.fr
Publisher
Cold Spring Harbor Laboratory
Reference32 articles.
1. The European Nucleotide Archive in 2019;Nucleic acids research,2020
2. The Sequence Read Archive;International Nucleotide Sequence Database Collaboration;Nucleic acids research,2010
3. Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997, 2013.
4. BEETL-fastq: a searchable compressed archive for DNA reads
5. BLAST+: architecture and applications
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献