Big Data for Beginners-Reference-Cited by-同舟云学术

Big Data for Beginners

Published:2023-08-18 Issue: Volume:7 Page:
ISSN:2535-0897
Container-title:Biodiversity Information Science and Standards
language:
Short-container-title:BISS

Author:

Huybrechts Pieter^ORCID

Abstract

With the increasing amount of datasets being published and made available through global aggregators, such as the Global Biodiversity Information Facility (GBIF), new opportunities have opened to answer research questions that previously could not be considered. Techniques for large scale data integration offer benefits for the biodiversity research community (Heberling et al. 2021, Kays et al. 2020), profiting from the great and continuing efforts in data mobilisation and standardisation (such as Darwin Core, Wieczorek et al. 2012). These benefits include integrating several large data sources or enriching existing occurrence data with other information. Several commonly encountered barriers to large-scale use of biodiversity occurrence data exist. These include the lack of facilities for local storage of large and rapidly changing datasets, the computational power required for processing, unfamiliarity with existing toolsets, and insufficient resources to maintain big data infrastructure. These challenges are well documented in the context of high-throughput genomics (Marx 2013), and more recently in occurrence-based biodiversity research (for example Thessen et al. 2018). However, while these hurdles and bottlenecks are very real, several of them have low cost of entry solutions. The aim of this presentation is to encourage the community to explore ambitious queries, to combine and examine all available data in its totality and to break down specific technical barriers, by providing a practical overview for researchers to maximise the power of large-scale data processing in their work. While big data processing may seem daunting, tools accessible to users without a background in big data are available for both local workstations and cloud computing services that allow for scalable data processing at low cost, for instance Databricks Community Edition or Apache Arrow. Using these resources, researchers can incorporate larger datasets into existing protocols, and by doing so, uncover patterns and insights that would be otherwise impossible to acquire using smaller subsets of the ever-expanding complex set that biodiversity occurrence data presents.

Publisher

Pensoft Publishers

Subject

General Engineering

Link

https://biss.pensoft.net/article/111301/download/pdf/

Reference5 articles.

1. Data integration enables global biodiversity synthesis

2. Born‐digital biodiversity data: Millions and billions

3. The big challenges of big data

4. 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration

5. Darwin Core: An Evolving Community-Developed Biodiversity Data Standard