Dual-objective fine-tuning of BERT for entity matching-Reference-Cited by-同舟云学术

Dual-objective fine-tuning of BERT for entity matching

Published:2021-06 Issue:10 Volume:14 Page:1913-1921
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Peeters Ralph¹,Bizer Christian¹

Affiliation:

1. University of Mannheim, Mannheim, Germany

Abstract

An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a subset of the entity descriptions to be integrated while such identifiers are not available for others. The challenge in these settings is to learn a matcher for entity descriptions without identifiers using the entity descriptions containing identifiers as training data. The task can be approached by learning a binary classifier which distinguishes pairs of entity descriptions for the same real-world entity from descriptions of different entities. The task can also be modeled as a multi-class classification problem by learning classifiers for identifying descriptions of individual entities. We present a dual-objective training method for BERT, called JointBERT, which combines binary matching and multi-class classification, forcing the model to predict the entity identifier for each entity description in a training pair in addition to the match/non-match decision. Our evaluation across five entity matching benchmark datasets shows that dual-objective training can increase the matching performance for seen products by 1% to 5% F1 compared to single-objective Transformer-based methods, given that enough training data is available for both objectives. In order to gain a deeper understanding of the strengths and weaknesses of the proposed method, we compare JointBERT to several other BERT-based matching methods as well as baseline systems along a set of specific matching challenges. This evaluation shows that JointBERT, given enough training data for both objectives, outperforms the other methods on tasks involving seen products, while it underperforms for unseen products. Using a combination of LIME explanations and domain-specific word classes, we analyze the matching decisions of the different deep learning models and conclude that BERT-based models are better at focusing on relevant word classes compared to RNN-based models.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3467861.3467878

Cited by 37 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Enhancing Entity Resolution with a hybrid Active Machine Learning framework: Strategies for optimal learning in sparse datasets;Information Systems;2024-11

2. Construction of Knowledge Graphs: Current State and Challenges;Information;2024-08-22

3. LRER: A Low-Resource Entity Resolution Framework with Hybrid Information;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30

4. Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

5. A Critical Re-evaluation of Record Linkage Benchmarks for Learning-Based Matching Algorithms;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13