Affiliation:
1. Department of Science and Technology, University of Naples “Parthenope”, Naples, Italy
Abstract
Record linkage aims to identify records from multiple data sources that refer to the same entity of the real world. It is a well known data quality process studied since the second half of the last century, with an established pipeline and a rich literature of case studies mainly covering census, administrative or health domains. In this paper, a method to recognize matching records from real municipalities and banks through multiple similarity criteria and a Neural Network classifier is proposed: starting from a labeled subset of the available data, first several similarity measures are combined and weighted to build a feature vector, then a Multi-Layer Perceptron (MLP) network is trained and tested to find matching pairs. For validation, seven real datasets have been used (three from banks and four from municipalities), purposely chosen in the same geographical area to increase the probability of matches. The training only involved two municipalities, while testing involved all sources (municipalities vs. municipalities, banks vs banks and and municipalities vs. banks). The proposed method scored remarkable results in terms of both precision and recall, clearly outperforming threshold-based competitors.
Reference28 articles.
1. Record linkage for farm-level data analytics: comparison of deterministic, stochastic and machine learning methods;Aiken;Computers and Electronics in Agriculture,2019
2. Balancing training data for automated annotation of keywords: a case study;Batista,2003
3. A study of the behavior of several methods for balancing machine learning training data;Batista;ACM SIGKDD Explorations Newsletter,2004
4. SMOTE: synthetic minority over-sampling technique;Bowyer;Journal of Artificial Intelligence Research,2011
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献