Language identification on the web: Extending the dictionary method

Radim Řehůřek; Milan Kolkus

Conference Proceedings

Language identification on the web: Extending the dictionary method

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2009) 5449 LNCS 357-368

DOI: 10.1007/978-3-642-00382-0_29

58Citations

43Readers

Get full text

Abstract

Automated language identification of written text is a wellestablished research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character n-grams are in use, mainly with identification based on Markov modelsor on character n-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents. © Springer-Verlag Berlin Heidelberg 2009.

Cite

CITATION STYLE

APA

Řehůřek, R., & Kolkus, M. (2009). Language identification on the web: Extending the dictionary method. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5449 LNCS, pp. 357–368). https://doi.org/10.1007/978-3-642-00382-0_29

Language identification on the web: Extending the dictionary method

Abstract

Cite

Register to see more suggestions