All WARC and no playback: The materialities of data-centered web archives research

Author:

Maemura Emily1ORCID

Affiliation:

1. School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, USA

Abstract

This paper examines the Web ARChive (WARC) file format, revealing how the format has come to play a central role in the development and standardization of interoperable tools and methods for the international web archiving community. In the context of emerging big data approaches, I consider the sociotechnical relationships between material construction of data and information infrastructures for collecting and research. Analysis is inspired by Star and Griesemer's historical case of the Museum of Vertebrate Zoology which reveals how boundary objects and methods standardization are used to enroll actors in the work of collecting for natural history. I extend these concepts by pairing them with frameworks for studying digital materiality and the representational qualities of data artifacts. Through examples drawn from fieldwork observations studying two data-centered research projects, I consider how the materiality of the WARC format influences research methods and approaches to data extraction, selection, and transformation. Findings identify three modalities researchers use to configure WARC data for researcher needs: using indexes to support search queries, constructing derivative formats designed for certain types of analysis, and generating custom-designed datasets tailored for specific research purposes. Findings additionally reveal similarities in how these distinct methods approach automated data extraction by relying upon the WARC's standardized metadata elements. By interrogating whose information needs are being met and taken into account in the design of the WARC's underlying information representation, I reveal effects on the emerging field of web history, and consider alternative approaches to knowledge production with archived web data.

Funder

Social Sciences and Humanities Research Council of Canada

Publisher

SAGE Publications

Subject

Library and Information Sciences,Information Systems and Management,Computer Science Applications,Communication,Information Systems

Reference50 articles.

1. Bailey J (2020) Archive-It and Archives Unleashed Join Forces to Scale Research Use of Web Archives. Available at: https://blog.archive.org/2020/07/28/archive-it-and-archives-unleashed-join-forces-to-scale-research-use-of-web-archives/

2. Data journeys: Capturing the socio-material constitution of data objects and flows

3. The Internet Archive and the socio-technical construction of historical facts

4. Web Archive Search as Research: Methodological and Theoretical Implications

5. A material history of bits

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3