Sherlock: an open-source data platform to store, analyze and integrate Big Data for computational biologists-Reference-Cited by-同舟云学术

Sherlock: an open-source data platform to store, analyze and integrate Big Data for computational biologists

Published:2022-08-10 Issue: Volume:10 Page:409
ISSN:2046-1402
Container-title:F1000Research
language:en
Short-container-title:F1000Res

Author:

Bohar Balazs,Fazekas David,Madgwick Matthew,Csabai Luca,Olbei Marton,Korcsmáros Tamás,Szalay-Beko Mate

Abstract

In the era of Big Data, data collection underpins biological research more than ever before. In many cases, this can be as time-consuming as the analysis itself. It requires downloading multiple public databases with various data structures, and in general, spending days preparing the data before answering any biological questions. Here, we introduce Sherlock, an open-source, cloud-based big data platform (https://earlham-sherlock.github.io/) to solve this problem. Sherlock provides a gap-filling way for computational biologists to store, convert, query, share and generate biology data while ultimately streamlining bioinformatics data management. The Sherlock platform offers a simple interface to leverage big data technologies, such as Docker and PrestoDB. Sherlock is designed to enable users to analyze, process, query and extract information from extremely complex and large data sets. Furthermore, Sherlock can handle different structured data (interaction, localization, or genomic sequence) from several sources and convert them to a common optimized storage format, for example, the Optimized Row Columnar (ORC). This format facilitates Sherlock’s ability to quickly and efficiently execute distributed analytical queries on extremely large data files and share datasets between teams. The Sherlock platform is freely available on GitHub, and contains specific loader scripts for structured data sources of genomics, interaction and expression databases. With these loader scripts, users can easily and quickly create and work with specific file formats, such as JavaScript Object Notation (JSON) or ORC. For computational biology and large-scale bioinformatics projects, Sherlock provides an open-source platform empowering data management, analytics, integration and collaboration through modern big data technologies.

Funder

Biotechnology and Biological Sciences Research Council

Quadram Institute Bioscience

Earlham Institute

Publisher

F1000 Research Ltd

Subject

General Pharmacology, Toxicology and Pharmaceutics,General Immunology and Microbiology,General Biochemistry, Genetics and Molecular Biology,General Medicine

Link

https://f1000research.com/articles/10-409/v2/pdf

Reference24 articles.

1. Gene Ontology: tool for the unification of biology.;M Ashburner;Nat Genet.,2000

2. The Bgee suite: integrated curated expression atlas and comparative transcriptomics in animals.;F Bastian;Nucleic Acids Res.,2021

3. earlham-sherlock/earlham-sherlock.github.io: First release of the official Sherlock platform (Version v1.0.0).;B Bohár;Zenodo.,2021

4. mentha: a resource for browsing integrated protein-interaction networks.;A Calderone;Nat Methods.,2013

5. HINT: High-quality protein interactomes and their applications in understanding human disease.;J Das;BMC Syst Biol.,2012