Affiliation:
1. Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
2. University of Chinese Academy of Sciences
3. State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences
4. Brandeis University
Abstract
Entity resolution (ER) aims to identify data records referring to the same real-world entity. Most existing ER approaches rely on the assumption that the entity records to be resolved are homogeneous, i.e., their attributes are aligned. Unfortunately, entities in real-world datasets are often heterogeneous, usually coming from different sources and being represented using different attributes. Furthermore, the entities’ attribute values may be redundant, noisy, missing, misplaced, or misspelled—we refer to it as the dirty data problem. To resolve the above problems, this paper proposes an end-to-end hierarchical matching network (HierMatcher) for entity resolution, which can jointly match entities in three levels—token, attribute, and entity. At the token level, a cross-attribute token alignment and comparison layer is designed to adaptively compare heterogeneous entities. At the attribute level, an attribute-aware attention mechanism is proposed to denoise dirty attribute values. Finally, the entity level matching layer effectively aggregates all matching evidence for the final ER decisions. Experimental results show that our method significantly outperforms previous ER methods on homogeneous, heterogeneous and dirty datasets.
Publisher
International Joint Conferences on Artificial Intelligence Organization
Cited by
24 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献