Abstract
It has long been suspected that web archives and search engines favor Western and English language webpages. In this article, we quantitatively explore howwell indexed and archived Arabic languagewebpages are as compared to those from other languages. We began by sampling 15,092 unique URIs from three different website directories: DMOZ (multilingual), Raddadi, and Star28 (the last two primarily Arabic language). Using language identification tools, we eliminated pages not in the Arabic language (e.g., Englishlanguage versions of Aljazeera pages) and culled the collection to 7,976 Arabic language webpages. We then used these 7,976 pages and crawled the live web and web archives to produce a collection of 300,646 Arabic language pages. We compared the analysis of Arabic language pages with that of English, Danish, and Korean language pages. First, for each language, we sampled unique URIs from DMOZ; then, using language identification tools, we kept only pages in the desired language. Finally, we crawled the archived and live web to collect a larger sample of pages in English, Danish, or Korean. In total for the four languages, we analyzed over 500,000 webpages. We discovered: (1) English has a higher archiving rate than Arabic, with 72.04% archived. However, Arabic has a higher archiving rate than Danish and Korean, with 53.36% of Arabic URIs archived, followed by Danish and Korean with 35.89% and 32.81% archived, respectively. (2)Most Arabic and English language pages are located in the United States; only 14.84% of the Arabic URIs had an Arabic country code top-level domain (e.g., .sa) and only 10.53% had a GeoIP in an Arabic country. Most Danish-language pages were located in Denmark, and most Korean-language pages were located in South Korea. (3) The presence of a webpage in a directory positively impacts indexing and presence in the DMOZ directory, specifically, positively impacts archiving in all four languages. In this work, we show that web archives and search engines favor English pages. However, it is not universally true for all Westernlanguage webpages because, in this work, we show that Arabic webpages have a higher archival rate than Danish language webpages.
Author supplied keywords
Cite
CITATION STYLE
Alkwai, L. M., Nelson, M. L., & Weigle, M. C. (2017). Comparing the archival rate of Arabic, English, Danish, and Korean language web pages. ACM Transactions on Information Systems, 36(1). https://doi.org/10.1145/3041656
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.