Affiliation:
1. School of Mathematical and Computer Science, Heriot-Watt University, Edinburgh, Currie EH14 4AS, UK
2. School of Informatics, The University of Edinburgh, Edinburgh, EH8 9AB, UK
Abstract
Wikidata is a collaborative multi-purpose Knowledge Graph (KG) with the unique feature of adding provenance data to the statements of items as a reference. More than 73% of Wikidata statements have provenance metadata; however, few studies exist on the referencing quality in this KG, focusing only on the relevancy and trustworthiness of external sources. While there are existing frameworks to assess the quality of Linked Data, and in some aspects their metrics investigate provenance, there are none focused on reference quality. We define a comprehensive referencing quality assessment framework based on Linked Data quality dimensions, such as completeness and understandability. We implement the objective metrics of the assessment framework as the Referencing Quality Scoring System – RQSS. The system provides quantified scores by which the referencing quality can be analyzed and compared. RQSS scripts can also be reused to monitor the referencing quality regularly. Due to the scale of Wikidata, we have used well-defined subsets to evaluate the quality of references in Wikidata using RQSS. We evaluate RQSS over three topical subsets: Gene Wiki, Music, and Ships, corresponding to three Wikidata WikiProjects, along with four random subsets of various sizes. The evaluation shows that RQSS is practical and provides valuable information, which can be used by Wikidata contributors and project holders to identify the quality gaps. Based on RQSS, the average referencing quality in Wikidata subsets is 0.58 out of 1. Random subsets (representative of Wikidata) have higher overall scores than topical subsets by 0.05, with Gene Wiki having the highest scores amongst topical subsets. Regarding referencing quality dimensions, all subsets have high scores in accuracy, availability, security, and understandability, but have weaker scores in completeness, verifiability, objectivity, and versatility. Although RQSS is developed based on the Wikidata RDF model, its referencing quality assessment framework can be applied to KGs in general.