Unsupervised Graph-Based Entity Resolution for Complex Entities

Author:

Kirielle Nishadi1,Christen Peter1,Ranbaduge Thilina1

Affiliation:

1. School of Computing, The Australian National University, Canberra, ACT, Australia

Abstract

Entity resolution (ER) is the process of linking records that refer to the same entity. Traditionally, this process compares attribute values of records to calculate similarities and then classifies pairs of records as referring to the same entity or not based on these similarities. Recently developed graph-based ER approaches combine relationships between records with attribute similarities to improve linkage quality. Most of these approaches only consider databases containing basic entities that have static attribute values and static relationships, such as publications in bibliographic databases. In contrast, temporal record linkage addresses the problem where attribute values of entities can change over time. However, neither existing graph-based ER nor temporal record linkage can achieve high linkage quality on databases with complex entities , where an entity (such as a person) can change its attribute values over time while having different relationships with other entities at different points in time. In this article, we propose an unsupervised graph-based ER framework that is aimed at linking records of complex entities. Our framework provides five key contributions. First, we propagate positive evidence encountered when linking records to use in subsequent links by propagating attribute values that have changed. Second, we employ negative evidence by applying temporal and link constraints to restrict which candidate record pairs to consider for linking. Third, we leverage the ambiguity of attribute values to disambiguate similar records that, however, belong to different entities. Fourth, we adaptively exploit the structure of relationships to link records that have different relationships. Fifth, using graph measures, we refine matched clusters of records by removing likely wrong links between records. We conduct extensive experiments on seven real-world datasets from different domains showing that on average our unsupervised graph-based ER framework can improve precision by up to 25% and recall by up to 29% compared to several state-of-the-art ER techniques.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference53 articles.

1. Asma Abboura, Soror Sahrl, Mourad Ouziri, and Salima Benbernou. 2015. CrowdMD: Crowdsourcing-based approach for deduplication. In Proceedings of the International Conference on Big Data. IEEE, 2621–2627.

2. Collective entity resolution in relational data;Bhattacharya Indrajit;Transactions on Knowledge Discovery from Data,2007

3. Population Reconstruction

4. Brabant Historical Information Center. 2021. Genealogie.Retrieved June 29 2021 from https://opendata.picturae.com/organization/bhic.

5. Yueh-Hsuan Chiang, AnHai Doan, and Jeffrey F Naughton. 2014. Modeling entity evolution for temporal record matching. In Proceedings of the SIGMOD International Conference on Management of Data. ACM, 1175–1186.

Cited by 5 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Enhancing entity resolution with multichannel BERT: a comprehensive approach;Third International Conference on Algorithms, Microchips, and Network Applications (AMNA 2024);2024-06-08

2. Better entity matching with transformers through ensembles;Knowledge-Based Systems;2024-06

3. A comprehensive survey of fake news in social networks: Attributes, features, and detection approaches;Journal of King Saud University - Computer and Information Sciences;2023-06

4. Training Data Selection for Record Linkage Classification;Symmetry;2023-05-10

5. Exploring the use of topological data analysis to automatically detect data quality faults;Frontiers in Big Data;2022-12-05

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3