Abstract
Language identification is the task of automatically detecting the language(s) present in a document based on the content of the document. In this work, we address the problem of detecting documents that contain text from more than one language ( multilingual documents). We introduce a method that is able to detect that a document is multilingual, identify the languages present, and estimate their relative proportions. We demonstrate the effectiveness of our method over synthetic data, as well as real-world multilingual documents collected from the web.
Cite
CITATION STYLE
Lui, M., Lau, J. H., & Baldwin, T. (2014). Automatic Detection and Language Identification of Multilingual Documents. Transactions of the Association for Computational Linguistics, 2, 27–40. https://doi.org/10.1162/tacl_a_00163
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.