Affiliation:
1. Polytechnique Montréal, Canada
Abstract
The cultural world offers a staggering amount of rich and varied metadata on cultural heritage, accumulated by governmental, academic, and commercial players. However, the variety of involved institutions means that the data are stored in as many complex and often incompatible models and standards, which limits its availability and explorability by the greater public.
The adoption of Linked Open Data technologies allows a strong interlinking of these various databases as well as external connections with existing knowledge bases. However, as they often contain references to the same entities, the delicate issue of entity alignment becomes the central challenge, especially in the absence or scarcity of unique global identifiers.
To tackle this issue, we explored two approaches, one based on a set of heuristic rules and one based on masked language models, or masked language models (MLMs). We compare these two approaches, as well as different variations of MLMs, including some models trained on a different language, and various levels of data cleaning and labeling. Our results show that heuristics are a solid approach but also that MLM-based entity alignment obtains better performance coupled with the fact that it is robust to the data format and does not require any form of data preprocessing, which was not the case of the heuristic approach in our experiments.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Computer Science Applications,Information Systems,Conservation
Reference37 articles.
1. Juriaan Baas Mehdi M. Dastani and Ad J. Feelders. 2021. Entity matching in digital humanities knowledge graphs. ISSN 1613 (2021) 0073. http://ceur-ws.org
2. Interlinking large-scale library data with authority records;Bensmann Felix;Front. Digit. Human.,2017
3. Ursin Brunner and Kurt Stockinger. 2020. Entity matching with transformer architectures-a step forward in data integration. In Proceedings of the International Conference on Extending Database Technology. OpenProceedings.
4. Giovanni Colavizza, Maud Ehrmann, and Yannick Rochat. 2016. A method for record linkage with sparse historical data. In Proceedings of the Digital Humanities Conference.
5. BLOSS: Effective meta-blocking with almost no effort