Clustering Heterogeneous Data Values for Data Quality Analysis-Reference-Cited by-同舟云学术

Clustering Heterogeneous Data Values for Data Quality Analysis

Published:2023-08-22 Issue:3 Volume:15 Page:1-33
ISSN:1936-1955
Container-title:Journal of Data and Information Quality
language:en
Short-container-title:J. Data and Information Quality

Author:

Wenz Viola¹^ORCID,Kesper Arno¹^ORCID,Taentzer Gabriele¹^ORCID

Affiliation:

1. Philipps-Universität Marburg, Germany

Abstract

Data is of high quality if it is fit for its intended purpose. Data heterogeneity can be a major quality problem, as quality aspects such as understandability and consistency can be compromised. Heterogeneity of data values is particularly common when data is manually entered by different people using inadequate control rules. In this case, syntactic and semantic heterogeneity often go hand in hand. Heterogeneity of data values may be a direct result of problems in the acquisition process, quality problems of the underlying data model, or possibly erroneous data transformations. For example, in the cultural heritage domain, it is common to analyze data fields by manually searching lists of data values sorted alphabetically or by number of occurrences. Additionally, search functions such as regular expression matching are used to detect specific patterns. However, this requires a priori knowledge and technical skills that domain experts often do not have. Since such datasets often contain thousands of values, the entire process is very time-consuming. Outliers or subtle differences between values that may be critical to data quality can be easily overlooked. To improve this process of analyzing the quality of data values, we propose a bottom-up human-in-the-loop approach that clusters values of a data field according to syntactic similarity. The clustering is intended to help domain experts explore the heterogeneity of values in a data field and can be configured by domain experts according to their domain knowledge. The overview of the syntactic diversity of the data values gives an impression of the rules and practices of data acquisition as well as their violations. From this, experts can infer potential quality issues with the data acquisition process and system, as well as the data model and data transformations. We outline a proof-of-concept implementation of the approach. Our evaluation found that clustering adds value to data quality analysis, especially for detecting quality problems in data models.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems and Management,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3603710

Reference66 articles.

1. Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Peter Buneman and Sushil Jajodia (Eds.). ACM Press, 207–216. 10.1145/170035.170072

2. XML data clustering

3. Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings ACM SIGMOD International Conference on Management of Data (SIGMOD’99), Alex Delis, Christos Faloutsos, and Shahram Ghandeharizadeh (Eds.). ACM Press, 49–60. 10.1145/304182.304187

4. A tool environment for quality assurance based on the Eclipse Modeling Framework

5. Carlo Batini. 2016. Data and Information Quality: Dimensions, Principles and Techniques. Springer, Berlin, Germany.

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Integration Approaches for Heterogeneous Big Data: A Survey;Cybernetics and Information Technologies;2024-03-01

2. AI-Powered Data Governance: A Cutting-Edge Method for Ensuring Data Quality for Machine Learning Applications;2024 Second International Conference on Emerging Trends in Information Technology and Engineering (ICETITE);2024-02-22

3. Putting Sense into Incomplete Heterogeneous Data with Hypergraph Clustering Analysis;Lecture Notes in Computer Science;2024

4. Current Challenges of Big Data Quality Management in Big Data Governance: A Literature Review;Lecture Notes on Data Engineering and Communications Technologies;2024