Automatic Detection and Language Identification of Multilingual Documents

Marco Lui; Jey Han Lau; Timothy Baldwin

Journal ArticleOPEN ACCESS

Automatic Detection and Language Identification of Multilingual Documents

Lui M
Lau J
Baldwin T

Transactions of the Association for Computational Linguistics (2014) 2 27-40

DOI: 10.1162/tacl_a_00163

N/ACitations

168Readers

Abstract

Language identification is the task of automatically detecting the language(s) present in a document based on the content of the document. In this work, we address the problem of detecting documents that contain text from more than one language ( multilingual documents). We introduce a method that is able to detect that a document is multilingual, identify the languages present, and estimate their relative proportions. We demonstrate the effectiveness of our method over synthetic data, as well as real-world multilingual documents collected from the web.

Cite

CITATION STYLE

APA

Lui, M., Lau, J. H., & Baldwin, T. (2014). Automatic Detection and Language Identification of Multilingual Documents. Transactions of the Association for Computational Linguistics, 2, 27–40. https://doi.org/10.1162/tacl_a_00163

Automatic Detection and Language Identification of Multilingual Documents

Abstract

Cite

Register to see more suggestions