Abstract
AbstractNowadays, data integration must often manage noisy data, also containing attribute values written in natural language such as product descriptions or book reviews. In the data integration process, Entity Linkage has the role of identifying records that contain information referring to the same object. Modern Entity Linkage methods, in order to reduce the dimension of the problem, partition the initial search space into “blocks” of records that can be considered similar according to some metrics, comparing then only the records belonging to the same block and thus greatly reducing the overall complexity of the algorithm. In this paper, we propose two automatic blocking strategies that, differently from the traditional methods, aim at capturing the semantic properties of data by means of recent deep learning frameworks. Both methods, in a first phase, exploit recent research on tuple and sentence embeddings to transform the database records into real-valued vectors; in a second phase, to arrange the tuples inside the blocks, one of them adopts approximate nearest neighbourhood algorithms, while the other one uses dimensionality reduction techniques combined with clustering algorithms. We train our blocking models on an external, independent corpus, and then, we directly apply them to new datasets in an unsupervised fashion. Our choice is motivated by the fact that, in most data integration scenarios, no training data are actually available. We tested our systems on six popular datasets and compared their performances against five traditional blocking algorithms. The test results demonstrated that our deep-learning-based blocking solutions outperform standard blocking algorithms, especially on textual and noisy data.
Publisher
Springer Science and Business Media LLC
Subject
Computer Science Applications,Computational Mechanics
Reference36 articles.
1. Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional space. In: International conference on database theory
2. Aizawa A, Oyama K (2005) A fast linkage detection scheme for multi-source information integration. In: WIRI
3. Baxter R, Christen P, Churches T (2003) A comparison of fast blocking methods for record linkage. In: KDD
4. Bellman RE (2015) Adaptive control processes: a guided tour. Princeton University Press, Princeton
5. Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin
Cited by
23 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献