Author:
Hagan Nicholas Kofi Akortia,Talburt John R.,Anderson Kris E.,Hagan Deasia
Abstract
Traditional data curation processes typically depend on human intervention. As data volume and variety grow exponentially, organizations are striving to increase efficiency of their data processes by automating manual processes and making them as unsupervised as possible. An additional challenge is to make these unsupervised processes scalable to meet the demands of increased data volume. This paper describes the parallelization of an unsupervised entity resolution (ER) process. ER is a component of many different data curation processes because it clusters records from multiple data sources that refer to the same real-world entity, such as the same customer, patient, or product. The ability to scale ER processes is particularly important because the computation effort of ER increases quadratically with data volume. The Data Washing Machine (DWM) is an already proposed unsupervised ER system which clusters references from diverse data sources. This work aims at solving the single-threaded nature of the DWM by adopting the parallelization nature of Hadoop MapReduce. However, the proposed parallelization method can be applied to both supervised systems, where matching rules are created by experts, and unsupervised systems, where expert intervention is not required. The DWM uses an entropy measure to self-evaluate the quality of record clustering. The current single-threaded implementations of the DWM in Python and Java are not scalable beyond a few 1,000 records and rely on large, shared memory. The objective of this research is to solve the major two shortcomings of the current design of the DWM which are the creation and usage of shared memory and lack of scalability by leveraging on the power of Hadoop MapReduce. We propose Hadoop Data Washing Machine (HDWM), a MapReduce implementation of the legacy DWM. The scalability of the proposed system is displayed using publicly available ER datasets. Based on results from our experiment, we conclude that HDWM can cluster from 1,000's to millions of equivalent references using multiple computational nodes with independent RAM and CPU cores.
Reference29 articles.
1. A scalable, hybrid entity resolution process for unstandardized entity references;Al Sarkhi;J. Comput. Sci. Coll,2020
2. Estimating the parameters for linking unstandardized references with the matrix comparator;Al Sarkhi;J. Inform. Technol. Manag,2018
3. An analysis of the effect of stop words on the performance of the matrix comparator for entity resolution;Al Sarkhi;J. Comput. Sci. Coll,2019
4. “Optimal starting parameters for unsupervised data clustering and cleaning in the data washing machine,”;Anderson;Proceeding: Future Technologies Conference (FTC'23).,2023
5. A survey of indexing techniques for scalable record linkage and deduplication;Christen;IEEE Trans. Knowl. Data Eng,2012
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献