Affiliation:
1. Old Dominion University, Norfolk, USA
Abstract
It has long been suspected that web archives and search engines favor Western and English language webpages. In this article, we quantitatively explore how well indexed and archived Arabic language webpages are as compared to those from other languages. We began by sampling 15,092 unique URIs from three different website directories: DMOZ (multilingual), Raddadi, and Star28 (the last two primarily Arabic language). Using language identification tools, we eliminated pages not in the Arabic language (e.g., English-language versions of Aljazeera pages) and culled the collection to 7,976 Arabic language webpages. We then used these 7,976 pages and crawled the live web and web archives to produce a collection of 300,646 Arabic language pages. We compared the analysis of Arabic language pages with that of English, Danish, and Korean language pages. First, for each language, we sampled unique URIs from DMOZ; then, using language identification tools, we kept only pages in the desired language. Finally, we crawled the archived and live web to collect a larger sample of pages in English, Danish, or Korean. In total for the four languages, we analyzed over 500,000 webpages. We discovered: (1) English has a higher archiving rate than Arabic, with 72.04% archived. However, Arabic has a higher archiving rate than Danish and Korean, with 53.36% of Arabic URIs archived, followed by Danish and Korean with 35.89% and 32.81% archived, respectively. (2) Most Arabic and English language pages are located in the United States; only 14.84% of the Arabic URIs had an Arabic country code top-level domain (e.g., sa) and only 10.53% had a GeoIP in an Arabic country. Most Danish-language pages were located in Denmark, and most Korean-language pages were located in South Korea. (3) The presence of a webpage in a directory positively impacts indexing and presence in the DMOZ directory, specifically, positively impacts archiving in all four languages. In this work, we show that web archives and search engines favor English pages. However, it is not universally true for all Western-language webpages because, in this work, we show that Arabic webpages have a higher archival rate than Danish language webpages.
Funder
University of Hail, Hail, Saudi Arabia
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,General Business, Management and Accounting,Information Systems
Reference55 articles.
1. Central Intelligence Agency (ed.). 2015. The World Factbook 2014-15. Government Printing Office Washington DC. Central Intelligence Agency (ed.). 2015. The World Factbook 2014-15. Government Printing Office Washington DC.
2. How much of the web is archived?
3. MemGator - A Portable Concurrent Memento Aggregator
4. Estimating the size of Arabic indexed web content;Alarifi Abdulrahman;Scientific Research and Essays,2012
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献