Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets

Author:

Mountantonakis Michalis1,Tzitzikas Yannis1

Affiliation:

1. Institute of Computer Science, FORTH-ICS, Greece 8 Computer Science Department, University of Crete, Greece

Abstract

Although the ultimate objective of Linked Data is linking and integration, it is not currently evident how connected the current Linked Open Data (LOD) cloud is. In this article, we focus on methods, supported by special indexes and algorithms, for performing measurements related to the connectivity of more than two datasets that are useful in various tasks including (a) Dataset Discovery and Selection ; (b) Object Coreference , i.e., for obtaining complete information about a set of entities, including provenance information; (c) Data Quality Assessment and Improvement , i.e., for assessing the connectivity between any set of datasets and monitoring their evolution over time, as well as for estimating data veracity; (d) Dataset Visualizations ; and various other tasks. Since it would be prohibitively expensive to perform all these measurements in a naïve way, in this article, we introduce indexes (and their construction algorithms) that can speed up such tasks. In brief, we introduce (i) a namespace-based prefix index, (ii) a sameAs catalog for computing the symmetric and transitive closure of the owl:sameAs relationships encountered in the datasets, (iii) a semantics-aware element index (that exploits the aforementioned indexes), and, finally, (iv) two lattice-based incremental algorithms for speeding up the computation of the intersection of URIs of any set of datasets. For enhancing scalability, we propose parallel index construction algorithms and parallel lattice-based incremental algorithms, we evaluate the achieved speedup using either a single machine or a cluster of machines, and we provide insights regarding the factors that affect efficiency. Finally, we report measurements about the connectivity of the (billion triples-sized) LOD cloud that have never been carried out so far.

Funder

the General Secretariat for Research and Technology (GSRT) and the Hellenic Foundation for Research and Innovation

European Union's Horizon 2020 research BlueBRIDGE project

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems and Management,Information Systems

Cited by 19 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Open dataset discovery using context-enhanced similarity search;Knowledge and Information Systems;2022-09-04

2. Modular framework for similarity-based dataset discovery using external knowledge;Data Technologies and Applications;2022-02-15

3. LODChain: Strengthen the Connectivity of Your RDF Dataset to the Rest LOD Cloud;The Semantic Web – ISWC 2022;2022

4. How Your Cultural Dataset is Connected to the Rest Linked Open Data?;Trandisciplinary Multispectral Modelling and Cooperation for the Preservation of Cultural Heritage;2022

5. Large scale services for connecting and integrating hundreds of linked datasets;ACM SIGWEB Newsletter;2021-09

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3