Affiliation:
1. Department of Information Systems and Operations Management, Vienna University of Economics and Business, Austria
2. Department of Computer Science, Federal University of Juiz de Fora, Brazil
3. Vienna University of Technology, Austria
4. Complexity Science Hub Vienna, Austria
Abstract
In this paper, we delve into the crucial role of constraints in maintaining data integrity in knowledge graphs with a specific focus on Wikidata, one of the most extensive collaboratively maintained open data knowledge graphs on the Web. The World Wide Web Consortium (W3C) recommends the Shapes Constraint Language (SHACL) as the constraint language for validating Knowledge Graphs, which comes in two different levels of expressivity, SHACL-Core, as well as SHACL-SPARQL. Despite the availability of SHACL, Wikidata currently represents its property constraints through its own RDF data model, which relies on Wikidata’s specific reification mechanism based on authoritative namespaces, and – partially ambiguous – natural language definitions. In the present paper, we investigate whether and how the semantics of Wikidata property constraints, can be formalized using SHACL-Core, SHACL-SPARQL, as well as directly as SPARQL queries. While the expressivity of SHACL-Core turns out to be insufficient for expressing all Wikidata property constraint types, we present SPARQL queries to identify violations for all 32 current Wikidata constraint types. We compare the semantics of this unambiguous SPARQL formalization with Wikidata’s violation reporting system and discuss limitations in terms of evaluation via Wikidata’s public SPARQL query endpoint, due to its current scalability. Our study, on the one hand, sheds light on the unique characteristics of constraints defined by the Wikidata community, in order to improve the quality and accuracy of data in this collaborative knowledge graph. On the other hand, as a “byproduct”, our formalization extends existing benchmarks for both SHACL and SPARQL with a challenging, large-scale real-world use case.
Reference52 articles.
1. Using contemporary constraints to ensure data consistency
2. S. Abiteboul, P. Buneman and D. Suciu, Data on the Web: From Relations to Semistructured Data and XML, Morgan Kaufmann, 1999. ISBN 1-55860-622-X.
3. S. Abiteboul, R. Hull and V. Vianu, Foundations of Databases, Addison-Wesley, 1995, http://webdam.inria.fr/Alice/. ISBN 0-201-53771-0.
4. Reasoning about Explanations for Non-validation in SHACL
5. Repairing SHACL Constraint Violations Using Answer Set Programming