Deep entity matching with pre-trained language models-Reference-Cited by-同舟云学术

Deep entity matching with pre-trained language models

Published:2020-09 Issue:1 Volume:14 Page:50-60
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Li Yuliang¹,Li Jinfeng¹,Suhara Yoshihiko¹,Doan AnHai²,Tan Wang-Chiew¹

Affiliation:

1. Megagon Labs

2. University of Wisconsin Madison

Abstract

We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straight-forward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Ditto's matching capability. Ditto allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. Ditto also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, Ditto adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, Ditto is forced to learn "harder" to improve the model's matching capability. The optimizations we developed further boost the performance of Ditto by up to 9.8%. Perhaps more surprisingly, we establish that Ditto can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate Ditto's effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, Ditto achieves a high F1 score of 96.5%.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3421424.3421431

Cited by 167 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Dual data mapping with fine-tuned large language models and asset administration shells toward interoperable knowledge representation;Robotics and Computer-Integrated Manufacturing;2025-02

2. Enhancing Multi-field B2B Cloud Solution Matching via Contrastive Pre-training;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

3. OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

4. Efficient Mixture of Experts based on Large Language Models for Low-Resource Data Preprocessing;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

5. Extended ProMap datasets for product mapping;Electronic Commerce Research;2024-08-22