On the Characteristics of Language Tags on the Web

Joel Sommers

Conference Proceedings

On the Characteristics of Language Tags on the Web

Sommers J

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2018) 10771 LNCS 18-30

DOI: 10.1007/978-3-319-76481-8_2

1Citations

3Readers

Get full text

Abstract

The Internet is a global phenomenon. To support broad use of Internet applications such as the World Wide Web, character encodings have been developed for many scripts of the world’s languages and there are standard mechanisms for indicating that content is in a particular language and/or tailored to a particular region. In this paper we study the empirical characteristics of language tags used in HTTP transactions and in web pages to indicate the language of the content and possibly the script, region, and other information. To support our analysis, we develop a new algorithm to infer the value of a missing language tag for elements used to link to alternative language content. We analyze the top-level page for websites in the Alexa Top 1 Million, from six geographic perspectives. We find that one third of all pages do not include any language tags, that half of the remaining sites are tagged with English (en), and that about 10 K sites have malformed tags. We observe that 80 K sites are multilingual, and that there are hundreds of sites that offer content in the tens of languages. Besides malformed tags, we find numerous instances of correctly formed but likely erroneous language tags by using a Naïve Bayes-based language detection library and comparing its output with a given page’s language tag(s). Lastly, we comment on differences in language tags observed for the same site but from different geographic vantage points or by using different client language preferences via the HTTP Accept-Language header.

Cite

CITATION STYLE

APA

Sommers, J. (2018). On the Characteristics of Language Tags on the Web. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10771 LNCS, pp. 18–30). Springer Verlag. https://doi.org/10.1007/978-3-319-76481-8_2

On the Characteristics of Language Tags on the Web

Abstract

Cite

Register to see more suggestions