Shapes and frictions of synthetic data

Author:

Offenhuber Dietmar1ORCID

Affiliation:

1. Art+Design, Northeastern University, Boston, MA, USA

Abstract

Synthetic data are computer-generated data that mimic and substitute empirical observations without directly corresponding to real-world phenomena. Widely used in privacy protection, machine learning, and simulation, synthetic data is an emerging field only just beginning to be explored in the social sciences and critical data studies. However, recent developments, such as the use of synthetic data in the US Census and American Community Survey, make a reflection on the nature and implications of synthetic data urgent. While earlier work focused mostly on training data for machine-learning models, this paper presents a broad typology of synthetic data and discusses its frictions. The main argument presented is that the traditional representational model of data as symbolic references to corresponding physical or conceptual objects is insufficient for understanding and critically engaging with issues and implications of synthetic data. The paper discusses an alternative relational model, which defines data not through an object of reference but based on “who uses them, how and for which purposes”. The relational model is more productive for capturing the fact that synthetic data are defined through their purpose; their performance in a particular situation (such as training a machine learning model); and a context-dependent operationalization of evidence. The post-representational anything-goes epistemology of synthetic data can be productively challenged through a forensic approach that foregrounds the outliers, artifacts, and gaps in datasets as meaningful information.

Publisher

SAGE Publications

Reference72 articles.

1. Abowd JM (2018) Staring-down the database reconstruction theorem. In: Joint statistical meetings, Vancouver, BC, 2018, p.234. US Census Bureau.

2. Akrout M, Gyepesi B, Holló P, et al. (2023) Diffusion-based data augmentation for skin disease classification: Impact across original medical datasets to fully synthetic images. arXiv:2301.04802. arXiv. Available at: http://arxiv.org/abs/2301.04802 (accessed 24 May 2023).

3. Andrews G (2021) What is synthetic data? Available at: https://blogs.nvidia.com/blog/2021/06/08/what-is-synthetic-data/ (accessed 12 June 2023).

4. Graphs in Statistical Analysis

5. Arpit D, Jastrzębski S, Ballas N, et al. (2017) A closer look at memorization in deep networks. In: Proceedings of the 34th International Conference on Machine Learning, 17 July 2017, pp.233–242: PMLR.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3