Deep indexed active learning for matching heterogeneous entity representations-Reference-Cited by-同舟云学术

Deep indexed active learning for matching heterogeneous entity representations

Published:2021-09 Issue:1 Volume:15 Page:31-45
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Jain Arjit¹,Sarawagi Sunita¹,Sen Prithviraj²

Affiliation:

1. IIT Bombay

2. IBM Research

Abstract

Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on such tasks require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful pre-trained transformer language models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3485450.3485455

Reference72 articles.

1. To Index or Not to Index: Optimizing Exact Maximum Inner Product Search

2. A Fast Linkage Detection Scheme for Multi-Source Information Integration

3. On active learning of record matching packages

4. David Arthur and Sergei Vassilvitskii . 2007 . K-Means++: The Advantages of Careful Seeding . In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms ( New Orleans, Louisiana) (SODA '07). Society for Industrial and Applied Mathematics, USA, 1027--1035. David Arthur and Sergei Vassilvitskii. 2007. K-Means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (New Orleans, Louisiana) (SODA '07). Society for Industrial and Applied Mathematics, USA, 1027--1035.

5. Jordan T. Ash , Chicheng Zhang , Akshay Krishnamurthy , John Langford , and Alekh Agarwal . 2020 . Deep Batch Active Learning by Diverse , Uncertain Gradient Lower Bounds. In 8th International Conference on Learning Representations, ICLR 2020 , Addis Ababa, Ethiopia , April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=ryghZJBKPS Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. 2020. Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=ryghZJBKPS

Cited by 13 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Enhancing Entity Resolution with a hybrid Active Machine Learning framework: Strategies for optimal learning in sparse datasets;Information Systems;2024-11

2. Graph Deep Active Learning Framework for Data Deduplication;Big Data Mining and Analytics;2024-09

3. Open benchmark for filtering techniques in entity resolution;The VLDB Journal;2024-07-09

4. A Critical Re-evaluation of Record Linkage Benchmarks for Learning-Based Matching Algorithms;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

5. ERABQS: entity resolution based on active machine learning and balancing query strategy;Journal of Intelligent Information Systems;2024-03-26