Abstract
PurposeData quality is a major challenge in data management. For organizations, the cleanliness of data is a significant problem that affects many business activities. Errors in data occur for different reasons, such as violation of business rules. However, because of the huge amount of data, manual cleaning alone is infeasible. Methods are required to repair and clean the dirty data through automatic detection, which are data quality issues to address. The purpose of this work is to extend the density-based data cleaning approach using conditional functional dependencies to achieve better data repair.Design/methodology/approachA set of conditional functional dependencies is introduced as an input to the density-based data cleaning algorithm. The algorithm repairs inconsistent data using this set.FindingsThis new approach was evaluated through experiments on real-world as well as synthetic datasets. The repair quality was determined using the F-measure. The results showed that the quality and scalability of the density-based data cleaning approach improved when conditional functional dependencies were introduced.Originality/valueConditional functional dependencies capture semantic errors among data values. This work demonstrates that the density-based data cleaning approach can be improved in terms of repairing inconsistent data by using conditional functional dependencies.
Subject
Library and Information Sciences,Information Systems
Reference43 articles.
1. A density-based data cleaning approach for deduplication with data consistency and accuracy,2016
2. Al-janabi, S. and Janicki, R. (2019), “Generation and corruption of semi-structured and structured data”, in Karampelas, P., Kawash, J. and Tansel, O. (Eds), From Security to Community Detection in Social Networking Platforms, Lecture Notes in Social Networks, Springer International Publishing, Cham, pp. 159-169.
3. Sampling the repairs of functional dependency violations under hard constraints;VLDB Endowment,2010
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Product Automatic Design Process Detection System based on Genetic Algorithm;2023 International Conference on Integrated Intelligence and Communication Systems (ICIICS);2023-11-24
2. Knowledge Expansion Algorithm of Heterogeneous Network Big Data Based on Improved K-means Algorithm;2022 International Conference on Knowledge Engineering and Communication Systems (ICKES);2022-12-28
3. AI-Based Heterogenous Large-Scale English Translation Strategy;Mobile Information Systems;2022-02-09