Efficiently Processing and Storing Library Linked Data using Apache Spark and Parquet-Reference-Cited by-同舟云学术

Efficiently Processing and Storing Library Linked Data using Apache Spark and Parquet

Published:2018-09-26 Issue:3 Volume:37 Page:29-49
ISSN:2163-5226
Container-title:Information Technology and Libraries
language:
Short-container-title:ITAL

Author:

Sharma Kumar^ORCID,Marjit Ujjal,Biswas Utpal

Abstract

Resource Description Framework (RDF) is a commonly used data model in the Semantic Web environment. Libraries and various other communities have been using the RDF data model to store valuable data after it is extracted from traditional storage systems. However, because of the large volume of the data, processing and storing it is becoming a nightmare for traditional data-management tools. This challenge demands a scalable and distributed system that can manage data in parallel. In this article, a distributed solution is proposed for efficiently processing and storing the large volume of library linked data stored in traditional storage systems. Apache Spark is used for parallel processing of large data sets and a column-oriented schema is proposed for storing RDF data. The storage system is built on top of Hadoop Distributed File Systems (HDFS) and uses the Apache Parquet format to store data in a compressed form. The experimental evaluation showed that storage requirements were reduced significantly as compared to Jena TDB, Sesame, RDF/XML, and N-Triples file formats. SPARQL queries are processed using Spark SQL to query the compressed data. The experimental evaluation showed a good query response time, which significantly reduces as the number of worker nodes increases.

Publisher

Boston College University Libraries

Subject

Library and Information Sciences,Information Systems

Link

https://ejournals.bc.edu/index.php/ital/article/download/10177/pdf

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Advancing TBM Performance: Integrating Shield Friction Analysis and Machine Learning in Geotechnical Engineering;Geotechnics;2024-02-14

2. Ontology-Based Semantic Modeling of Coal Mine Roof Caving Accidents;Processes;2023-03-31

3. Linked data for libraries: Creating a global knowledge space, a systematic literature review;Journal of Information Science;2022-05-01

4. DPISCAN: Distributed and parallel architecture with indexing for structural clustering of massive dynamic graphs;International Journal of Data Science and Analytics;2022-01-12

5. Big data analysis and forensics;International Journal of Electronic Security and Digital Forensics;2022