Deep learning based approach to unstructured record linkage-Reference-Cited by-同舟云学术

Deep learning based approach to unstructured record linkage

Published:2021-10-18 Issue:6 Volume:17 Page:607-621
ISSN:1744-0084
Container-title:International Journal of Web Information Systems
language:en
Short-container-title:IJWIS

Author:

Jurek-Loughrey Anna

Abstract

Purpose In the world of big data, data integration technology is crucial for maximising the capability of data-driven decision-making. Integrating data from multiple sources drastically expands the power of information and allows us to address questions that are impossible to answer using a single data source. Record Linkage (RL) is a task of identifying and linking records from multiple sources that describe the same real world object (e.g. person), and it plays a crucial role in the data integration process. RL is challenging, as it is uncommon for different data sources to share a unique identifier. Hence, the records must be matched based on the comparison of their corresponding values. Most of the existing RL techniques assume that records across different data sources are structured and represented by the same scheme (i.e. set of attributes). Given the increasing amount of heterogeneous data sources, those assumptions are rather unrealistic. The purpose of this paper is to propose a novel RL model for unstructured data. Design/methodology/approach In the previous work (Jurek-Loughrey, 2020), the authors proposed a novel approach to linking unstructured data based on the application of the Siamese Multilayer Perceptron model. It was demonstrated that the method performed on par with other approaches that make constraining assumptions regarding the data. This paper expands the previous work originally presented at iiWAS2020 [16] by exploring new architectures of the Siamese Neural Network, which improves the generalisation of the RL model and makes it less sensitive to parameter selection. Findings The experimental results confirm that the new Autoencoder-based architecture of the Siamese Neural Network obtains better results in comparison to the Siamese Multilayer Perceptron model proposed in (Jurek et al., 2020). Better results have been achieved in three out of four data sets. Furthermore, it has been demonstrated that the second proposed (hybrid) architecture based on integrating the Siamese Autoencoder with a Multilayer Perceptron model, makes the model more stable in terms of the parameter selection. Originality/value To address the problem of unstructured RL, this paper presents a new deep learning based approach to improve the generalisation of the Siamese Multilayer Preceptron model and make is less sensitive to parameter selection.

Publisher

Emerald

Subject

Computer Networks and Communications,Information Systems

Reference33 articles.

1. Adaptive name matching in information integration;IEEE Intelligent Systems,2003

2. Signature verification using a ‘siamese’ time delay neural network;In: Advances in Neural Information Processing Systems,1994

3. A survey of indexing techniques for scalable record linkage and deduplication;IEEE Transactions on Knowledge and Data Engineering,2012

4. A comparison of string metrics for matching names and records;In Kdd Workshop on Data Cleaning and Object Consolidation,2003

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Towards Semantic Layer for Enhancing Blocking Entity Resolution Accuracy in Big Data;2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA);2024-02-01

2. Deep Learning Application in Continuous Authentication;Lecture Notes in Electrical Engineering;2024

3. Augmenting clinical trial economic analysis by linking cancer trial data to administrative data: current landscape and future opportunities;BMJ Open;2023-08

4. A Probabilistic Digital Twin for Leak Localization in Water Distribution Networks Using Generative Deep Learning;Sensors;2023-07-05