Affiliation:
1. The Hong Kong University of Science and Technology
2. The University of Tokyo
3. University of Toronto & Vector Institute
4. The Hong Kong Polytechnic University
Abstract
Entity resolution (ER) approaches typically consist of a blocker and a matcher. They share the same goal and cooperate in different roles: the blocker first quickly removes obvious non-matches, and the matcher subsequently determines whether the remaining pairs refer to the same real-world entity. Despite the state-of-the-art performance achieved by deep learning methods in ER, these techniques often rely on a large amount of labeled data for training, which can be challenging or costly to obtain. Thus, there is a need to develop effective ER systems under low-resource settings. In this work, we propose an end-to-end iterative Co-learning framework for ER, aimed at jointly training the blocker and the matcher by leveraging their cooperative relationship. In particular, we let the blocker and the matcher share their learned knowledge with each other via iteratively updated pseudo labels, which broaden the supervision signals. To mitigate the impact of noise in pseudo labels, we develop optimization techniques from three aspects: label generation, label selection and model training. Through extensive experiments on benchmark datasets, we demonstrate that our proposed framework outperforms baselines by an average of 9.13--51.55%. Furthermore, our analysis confirms that our framework achieves mutual benefits between the blocker and the matcher.
Publisher
Association for Computing Machinery (ACM)
Reference45 articles.
1. Maria-Fiorina Balcan, Avrim Blum, and Ke Yang. 2004. Co-training and expansion: Towards bridging theory and practice. In Proceedings of the 17th International Conference on Neural Information Processing Systems, Vol. 17. 89--96.
2. Nils Barlaug and Jon Atle Gulla. 2021. Neural networks for entity matching: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) 15, 3 (2021), 1--37.
3. Combining labeled and unlabeled data with co-training
4. Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901.
5. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks