MetaGraph: Indexing and Analysing Nucleotide Archives at Petabase-scale-Reference-Cited by-同舟云学术

MetaGraph: Indexing and Analysing Nucleotide Archives at Petabase-scale

Published:2020-10-02 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Karasikov Mikhail^ORCID,Mustafa Harun^ORCID,Danciu Daniel,Zimmermann Marc,Barber Christopher,Rätsch Gunnar^ORCID,Kahles André^ORCID

Abstract

AbstractThe amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making all this sequencing data searchable and easily accessible to life science and data science researchers is an unsolved problem. We presentMetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories.MetaGraphefficiently indexes vast collections of sequences to enable fast search and comprehensive analysis. A wide range of underlying data structures offer different practically relevant trade-offs between the space taken by the index and its query performance.MetaGraphprovides a flexible methodological framework allowing for index construction to be scaled from consumer laptops to distribution onto a cloud compute cluster for processing terabases to petabases of input data. Achieving compression ratios of up to 1,000-fold over the already compressed raw input data,MetaGraphcan represent the content of large sequencing archives in the working memory of a single compute server. We demonstrate our framework’s scalability by indexing over 1.4 million whole genome sequencing (WGS) records from NCBI’s Sequence Read Archive, representing a total input of more than three petabases.Besides demonstrating the utility ofMetaGraphindexes on key applications, such as experiment discovery, sequence alignment, error correction, and differential assembly, we make a wide range of indexes available as a community resource, including those over 450,000 microbial WGS records, more than 110,000 fungi WGS records, and more than 20,000 whole metagenome sequencing records. A subset of these indexes is made available online for interactive queries. All indexes created from public data comprising in total more than 1 million records are available for download or usage in the cloud.As an example of our indexes’ integrative analysis capabilities, we introduce the concept of differential assembly, which allows for the extraction of sequences present in a foreground set of samples but absent in a given background set. We apply this technique to differentially assemble contigs to identify pathogenic agents transfected via human kidney transplants. In a second example, we indexed more than 20,000 human RNA-Seq records from the TCGA and GTEx cohorts and use them to extract transcriptome features that are hard to characterize using a classical linear reference. We discovered over 200 trans-splicing events in GTEx and found broad evidence for tissue-specific non-A-to-I RNA-editing in GTEx and TCGA.

Publisher

Cold Spring Harbor Laboratory

Reference62 articles.

1. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database

2. Bahar Alipanahi , Alan Kuhnle , Simon J Puglisi , Leena Salmela , and Christina Boucher . Succinct Dynamic de Bruijn Graphs. Bioinformatics, 05 2020. btaa546.

3. Bahar Alipanahi , Martin D Muggli , Musa Jundi , Noelle R Noyes , and Christina Boucher . Metagenome snp calling via read colored de bruijn graphs. Bioinformatics, 2020.

4. Alexandre Almeida , Stephen Nayfach , Miguel Boland , Francesco Strozzi , Martin Beracochea , Zhou Jason Shi , Katherine S Pollard , Ekaterina Sakharova , Donovan H Parks , Philip Hugenholtz , et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature Biotechnology, pages 1–10, 2020.

5. Fatemeh Almodaresi , Prashant Pandey , Michael Ferdman , Rob Johnson , and Rob Patro . An efficient, scalable and exact representation of high-dimensional color information enabled via de bruijn graph search. In International Conference on Research in Computational Molecular Biology, pages 1–18. Springer, 2019.

Cited by 36 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Genomic Diversity as a Key Conservation Criterion: Proof‐of‐Concept From Mammalian Whole‐Genome Resequencing Data;Evolutionary Applications;2024-09

2. LexicMap: efficient sequence alignment against millions of prokaryotic genomes;2024-08-31

3. Insertions and Deletions: Computational Methods, Evolutionary Dynamics, and Biological Applications;Molecular Biology and Evolution;2024-08-22

4. Constrained enumeration ofk-mers from a collection of references with metadata;2024-05-31

5. Indexing and searching petabase-scale nucleotide resources;Nature Methods;2024-05-16