SparkDWM: a scalable design of a Data Washing Machine using Apache Spark-Reference-Cited by-同舟云学术

SparkDWM: a scalable design of a Data Washing Machine using Apache Spark

Published:2024-09-09 Issue: Volume:7 Page:
ISSN:2624-909X
Container-title:Frontiers in Big Data
language:
Short-container-title:Front. Big Data

Author:

Hagan Nicholas Kofi Akortia,Talburt John R.

Abstract

Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy Data Washing Machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.

Publisher

Frontiers Media SA

Reference33 articles.

1. A scalable, hybrid entity resolution process for unstandardized entity references;Al Sarkhi;J. Comp. Sci. Colleg.,2020

2. An analysis of the effect of stop words on the performance of the matrix comparator for entity resolution;Al Sarkhi;J. Comp. Sci. Colleg.

3. Estimating the parameters for linking unstandardized references with the matrix comparator;Al Sarkhi;J. Inform. Technol. Manag.

4. “Optimal starting parameters for unsupervised data clustering and cleaning in the data washing machine,”;Anderson;Proceedings of the Future Technologies Conference (FTC) 2023, Volume 2,2023

5. “A spark-based workflow for probabilistic record linkage of healthcare data,”;Pita;Edbt/Icdt Workshops,2015