An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data-Reference-Cited by-同舟云学术

An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data

Published:2022-05-19 Issue:10 Volume:22 Page:3856
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Wu Wanqing,Mao Wenyu^ORCID

Abstract

A crucial step in improving data quality is to discover semantic relationships between data. Functional dependencies are rules that describe semantic relationships between data in relational databases and have been applied to improve data quality recently. However, traditional functional discovery algorithms applied to distributed data may lead to errors and the inability to scale to large-scale data. To solve the above problems, we propose a novel distributed functional dependency discovery algorithm based on Apache Spark, which can effectively discover functional dependencies in large-scale data. The basic idea is to use data redistribution to discover functional dependencies in parallel on multiple nodes. In this algorithm, we take a sampling approach to quickly remove invalid functional dependencies and propose a greedy-based task assignment strategy to balance the load. In addition, the prefix tree is used to store intermediate computation results during the validation process to avoid repeated computation of equivalence classes. Experimental results on real and synthetic datasets show that the proposed algorithm in this paper is more efficient than existing methods while ensuring accuracy.

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/22/10/3856/pdf

Reference45 articles.

1. Data Science and its Relationship to Big Data and Data-Driven Decision Making

2. Inference and missing data

3. Dirty Data: The Effects of Screening Respondents Who Provide Low-Quality Data in Survey Research

4. Do Donors Discount Low-Quality Accounting Information?

5. Machine learning: Trends, perspectives, and prospects

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A trajectory data warehouse solution for workforce management decision-making;Data Science and Management;2023-06