Affiliation:
1. Yahoo Research and Facebook Inc.
2. Yahoo Research and Microsoft Research Lab India
3. Yahoo Research and Stanford University
4. Yahoo Research and Google Inc.
Abstract
In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or nonduplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0--1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more nonduplicate pairs than duplicate pairs). To address this, a recent paper [Arasu et al. 2010] proposes to maximize recall of the classifier under the constraint that its precision should be greater than a specified threshold. However, the proposed technique requires the labels of all
n
input pairs in the worst case.
Our main result is an active learning algorithm that approximately maximizes recall of the classifier while respecting a precision constraint with provably sublinear label complexity (under certain distributional assumptions). Our algorithm uses as a black box any active learning module that minimizes 0--1 loss. We show that label complexity of our algorithm is at most log
n
times the label complexity of the black box, and also bound the difference in the recall of classifier learnt by our algorithm and the recall of the optimal classifier satisfying the precision constraint. We provide an empirical evaluation of our algorithm on several real-world matching data sets that demonstrates the effectiveness of our approach.
Publisher
Association for Computing Machinery (ACM)
Cited by
11 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Efficient and robust active learning methods for interactive database exploration;The VLDB Journal;2023-11-16
2. Selective data acquisition in the wild for model charging;Proceedings of the VLDB Endowment;2022-03
3. New Algorithms for Monotone Classification;Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems;2021-06-20
4. RPT;Proceedings of the VLDB Endowment;2021-04
5. The Four Generations of Entity Resolution;Synthesis Lectures on Data Management;2021-03-15