A Survey of Data Quality Requirements That Matter in ML Development Pipelines-Reference-Cited by-同舟云学术

A Survey of Data Quality Requirements That Matter in ML Development Pipelines

Published:2023-06-22 Issue:2 Volume:15 Page:1-39
ISSN:1936-1955
Container-title:Journal of Data and Information Quality
language:en
Short-container-title:J. Data and Information Quality

Author:

Priestley Maria¹^ORCID,O’donnell Fionntán²^ORCID,Simperl Elena¹^ORCID

Affiliation:

1. King’s College London

2. Open Data Institute

Abstract

The fitness of the systems in which Machine Learning (ML) is used depends greatly on good-quality data. Specifications on what makes a good-quality dataset have traditionally been defined by the needs of the data users—typically analysts and engineers. Our article critically examines the extent to which established data quality frameworks are applicable to contemporary use cases in ML. Using a review of recent literature at the intersection of ML, data management, and human-computer interaction, we find that the classical “fitness-for-use” view of data quality can benefit from a more stage-specific approach that is sensitive to where in the ML lifecycle the data are encountered. This helps practitioners to plan their data quality tasks in a manner that meets the needs of the stakeholders who will encounter the dataset, whether it be data subjects, software developers or organisations. We therefore propose a new treatment of traditional data quality criteria by structuring them according to two dimensions: (1) the stage of the ML lifecycle where the use case occurs vs. (2) the main categories of data quality that can be pursued (intrinsic, contextual, representational and accessibility). To illustrate how this works in practice, we contribute a temporal mapping of the various data quality requirements that are important at different stages of the ML data pipeline. We also share some implications for data practitioners and organisations that wish to enhance their data management routines in preparation for ML.

Funder

European Union’s Horizon 2020 research and innovation programme under the projects EUHubs4Data

MediaFutures

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems and Management,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3592616

Reference77 articles.

1. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)

2. Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency