Language identification on the web: Extending the dictionary method

58Citations
Citations of this article
43Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Automated language identification of written text is a wellestablished research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character n-grams are in use, mainly with identification based on Markov modelsor on character n-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents. © Springer-Verlag Berlin Heidelberg 2009.

Cite

CITATION STYLE

APA

Řehůřek, R., & Kolkus, M. (2009). Language identification on the web: Extending the dictionary method. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5449 LNCS, pp. 357–368). https://doi.org/10.1007/978-3-642-00382-0_29

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free