Affiliation:
1. International Hellenic University, Greece
2. Hellenic Open University, Greece
Abstract
A huge amount of data, in terms of streams, are collected nowadays via a variety of sources, such as sensors, mobile devices, or even raw log files. The unprecedented rate at which these data are generated and collected calls for novel record linkage methods to identify matching records pairs, which refer to the same real-world entity. Towards this direction, blocking methods are used in order to reduce the number of candidate record pairs while still maintaining high levels of accuracy. This paper introduces ExpBlock, a randomized record linkage structure, which guarantees that both the most frequently accessed and recently used blocks remain in main memory and, additionally, the records within a block are renewed on a rolling basis. Specifically, the probability of inactive blocks and older records to remain in main memory decays in order to make room for more promising blocks and fresher records, respectively. We implement these features using random choices instead of utilizing cumbersome sorting data structures in order to favour simplicity of implementation and efficiency. We showcase, through the experimental evaluation, that ExplBlock scales efficiently to data streams by providing accurate results in a timely fashion.
Publisher
Association for Computing Machinery (ACM)
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Reference24 articles.
1. P. Christen . 2012. Data Matching - Concepts and Techniques for Record Linkage , Entity Resolution, and Duplicate Detection . Springer , Data- Centric Sys . and Appl. P. Christen. 2012. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Data-Centric Sys. and Appl.
2. A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication;Christen P.;TKDE,2012
3. Online entity resolution using an Oracle
4. L. Gazzari and M. Herschel. 2021. End-to-end Task Based Parallelization for Entity Resolution on Dynamic Data. In ICDE. 1248--1259. L. Gazzari and M. Herschel. 2021. End-to-end Task Based Parallelization for Entity Resolution on Dynamic Data. In ICDE. 1248--1259.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献