Why web sites are lost (and how they're sometimes found)

Author:

McCown Frank1,Marshall Catherine C.2,Nelson Michael L.3

Affiliation:

1. Harding University, Searcy, AR

2. Microsoft Research, Silicon Valley

3. Old Dominion University

Abstract

Introduction The web is in constant flux---new pages and Web sites appear daily, and old pages and sites disappear almost as quickly. One study estimates that about two percent of the Web disappears from its current location every week. 2 Although Web users have become accustomed to seeing the infamous "404 Not Found" page, they are more taken aback when they own, are responsible for, or have come to rely on the missing material. Web archivists like those at the Internet Archive have responded to the Web's transience by archiving as much of it as possible, hoping to preserve snapshots of the Web for future generations. 3 Search engines have also responded by offering pages that have been cached as a result of the indexing process. These straightforward archiving and caching efforts have been used by the public in unintended ways: individuals and organizations have used them to restore their own lost Web sites. 5 To automate recovering lost Web sites, we created a Web-repository crawler named Warrick that restores lost resources from the holdings of four Web repositories: Internet Archive, Google, Live Search (now Bing), and Yahoo; 6 we refer to these Web repositories collectively as the Web Infrastructure (WI). We call this after-loss recovery Lazy Preservation (see the sidebar for more information). Warrick can only recover what is accessible to the WI, namely the crawlable Web. There are numerous resources that cannot be found in the WI: password protected content, pages without incoming links or protected by the robots exclusion protocol, and content hidden behind Flash or JavaScript interfaces. Most importantly, WI crawlers do not have access to the server-side components (for example, scripts, configuration files, databases, among others) of a Web site. Nevertheless, upon Warrick's public release in 2005, we received many inquiries about its usage and collected a handful of anecdotes about the Web sites individuals and organizations had lost and wanted to recover. Were these Web sites representative? What types of Web resources were people losing? Given the inherent limitations of the WI, were Warrick users recovering enough material to reconstruct the site? Were these losses changing their behavior, or was the availability of cached material reinforcing a "lazy" approach to preservation? We constructed an online survey to explore these questions and conducted a set of in-depth interviews with survey respondents to clarify the results. Potential participants were solicited by us or the Internet Archive, or they found a link to the survey from the Warrick Web site. A total of 52 participants completed the survey regarding 55 lost Web sites, and seven of the participants allowed us to follow-up with telephone or instant messaging interviews. Participants were divided into two groups: 1. Personal loss: Those who had lost (and tried to recover) a Web site that they had personally created, maintained or owned (34 participants who lost 37 Web sites). 2. Third party: Those who had recovered someone else's lost Web site (18 participants who recovered 18 Web sites).

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference11 articles.

1. Cox L. P. Murray C. D. and Noble B. D. Pastiche: Making backup cheap and easy. SIGOPS Operating Systems Review 36 SI (2002) 285--298. 10.1145/844128.844155 Cox L. P. Murray C. D. and Noble B. D. Pastiche: Making backup cheap and easy. SIGOPS Operating Systems Review 36 SI (2002) 285--298. 10.1145/844128.844155

2. A large-scale study of the evolution of web pages

3. Preserving the Internet

Cited by 13 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Got 404s? Crawling and Analyzing an Institution’s Web Domain;Linking Theory and Practice of Digital Libraries;2022

2. Decolonizing Tactics as Collective Resilience: Identity Work of AAPI Communities on Reddit;Proceedings of the ACM on Human-Computer Interaction;2020-05-28

3. On Identifying the Bounds of an Internet Resource;Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval;2016-03-13

4. The impact of JavaScript on archivability;International Journal on Digital Libraries;2015-01-25

5. Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations;Legal Information Management;2014-06

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3