From Human to Data to Dataset: Mapping the Traceability of Human Subjects in Computer Vision Datasets

Author:

Scheuerman Morgan Klaus1ORCID,Weathington Katy1ORCID,Mugunthan Tarun2ORCID,Denton Emily3ORCID,Fiesler Casey1ORCID

Affiliation:

1. University of Colorado Boulder, Boulder, CO, USA

2. University of California Berkely, Berkeley, CA, USA

3. Google, New York, NY, USA

Abstract

Computer vision is a "data hungry" field. Researchers and practitioners who work on human-centric computer vision, like facial recognition, emphasize the necessity of vast amounts of data for more robust and accurate models. Humans are seen as a data resource which can be converted into datasets. The necessity of data has led to a proliferation of gathering data from easily available sources, including "public" data from the web. Yet the use of public data has significant ethical implications for the human subjects in datasets. We bridge academic conversations on the ethics of using publicly obtained data with concerns about privacy and agency associated with computer vision applications. Specifically, we examine how practices of dataset construction from public data-not only from websites, but also from public settings and public records-make it extremely difficult for human subjects to trace their images as they are collected, converted into datasets, distributed for use, and, in some cases, retracted. We discuss two interconnected barriers current data practices present to providing an ethics of traceability for human subjects: awareness and control. We conclude with key intervention points for enabling traceability for data subjects. We also offer suggestions for an improved ethics of traceability to enable both awareness and control for individual subjects in dataset curation practices.

Funder

National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Human-Computer Interaction,Social Sciences (miscellaneous)

Reference116 articles.

1. ACLU. 2020. Federal court rules 'Big Data' discrimination studies do not violate Federal anti-hacking law. https://www.aclu.org/press-releases/federal-court-rules-big-data-discrimination-studies-do-not-violate-federal-anti ACLU. 2020. Federal court rules 'Big Data' discrimination studies do not violate Federal anti-hacking law. https://www.aclu.org/press-releases/federal-court-rules-big-data-discrimination-studies-do-not-violate-federal-anti

2. Yuki M Asano , Christian Rupprecht , Andrew Zisserman , and Andrea Vedaldi . 2021 . PASS: An ImageNet replacement for self-supervised pretraining without humans. https://doi.org/10.48550/ARXIV.2109.13228 10.48550/ARXIV.2109.13228 Yuki M Asano, Christian Rupprecht, Andrew Zisserman, and Andrea Vedaldi. 2021. PASS: An ImageNet replacement for self-supervised pretraining without humans. https://doi.org/10.48550/ARXIV.2109.13228

3. John W. Ayers , Theodore L. Caputi , Camille Nebeker , and Mark Dredze . 2018. Don't quote me: reverse identification of research participants in social media studies. npj Digital Medicine , Vol. 1 , 1 (aug 2018 ), 1--2. https://doi.org/10.1038/s41746-018-0036--2 10.1038/s41746-018-0036--2 John W. Ayers, Theodore L. Caputi, Camille Nebeker, and Mark Dredze. 2018. Don't quote me: reverse identification of research participants in social media studies. npj Digital Medicine, Vol. 1, 1 (aug 2018), 1--2. https://doi.org/10.1038/s41746-018-0036--2

4. Data journeys: Capturing the socio-material constitution of data objects and flows

5. The Internet Archive and the socio-technical construction of historical facts

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3