Wikidata subsetting: Approaches, tools, and evaluation

Author:

Hosseini Beghaeiraveri Seyed Amir1,Labra Gayo Jose Emilio2,Waagmeester Andra3,Ammar Ammar4,Gonzalez Carolina5,Slenter Denise4,Ul-Hasan Sabah56,Willighagen Egon4,McNeill Fiona7,Gray Alasdair J.G.1

Affiliation:

1. School of Mathematical and Computer Science, Heriot-Watt University, Edinburgh, UK

2. University of Oviedo, Oviedo, Spain

3. Micelio, Belgium

4. Dept of Bioinformatics - BiGCaT, NUTRIM, Maastricht University, Netherlads

5. The Scripps Research Institute, US

6. Hologic Inc, US

7. School of Informatics, The University of Edinburgh, UK

Abstract

Wikidata is a massive Knowledge Graph (KG), including more than 100 million data items and nearly 1.5 billion statements, covering a wide range of topics such as geography, history, scholarly articles, and life science data. The large volume of Wikidata is difficult to handle for research purposes; many researchers cannot afford the costs of hosting 100 GB of data. While Wikidata provides a public SPARQL endpoint, it can only be used for short-running queries. Often, researchers only require a limited range of data from Wikidata focusing on a particular topic for their use case. Subsetting is the process of defining and extracting the required data range from the KG; this process has received increasing attention in recent years. Specific tools and several approaches have been developed for subsetting, which have not been evaluated yet. In this paper, we survey the available subsetting approaches, introducing their general strengths and weaknesses, and evaluate four practical tools specific for Wikidata subsetting – WDSub, KGTK, WDumper, and WDF – in terms of execution performance, extraction accuracy, and flexibility in defining the subsets. Results show that all four tools have a minimum of 99.96% accuracy in extracting defined items and 99.25% in extracting statements. The fastest tool in extraction is WDF, while the most flexible tool is WDSub. During the experiments, multiple subset use cases have been defined and the extracted subsets have been analyzed, obtaining valuable information about the variety and quality of Wikidata, which would otherwise not be possible through the public Wikidata SPARQL endpoint.

Publisher

IOS Press

Subject

Computer Networks and Communications,Computer Science Applications,Information Systems

Reference27 articles.

1. Building Knowledge Subgraphs in Question Answering over Knowledge Graphs

2. S.A.H. Beghaeiraveri, Towards automated technologies in the referencing quality of Wikidata, in: Companion Proceedings of the Web Conference 2022, 2022, https://www2022.thewebconf.org/PaperFiles/8.pdf.

3. S.A.H. Beghaeiraveri, A. Gray and F. McNeill, Reference statistics in Wikidata topical subsets, in: Proceedings of the 2nd Wikidata Workshop (Wikidata 2021), CEUR Workshop Proceedings, CEUR, Virtual Conference, Vol. 2982, 2021, ISSN: 1613-0073, https://researchportal.hw.ac.uk/files/53252708/Reference_Statistics_in_Wikidata_Topical_Subsets_corrected_version.pdf.

4. S.A.H. Beghaeiraveri, A.J.G. Gray and F.J. McNeill, Experiences of using WDumper to create topical subsets from Wikidata, in: CEUR Workshop Proceedings, Vols 2873, CEUR-WS, 2021, p. 13, ISSN: 1613–0073, https://researchportal.hw.ac.uk/files/45184682/paper13.pdf.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3