Automatic Detection and Language Identification of Multilingual Documents

  • Lui M
  • Lau J
  • Baldwin T
N/ACitations
Citations of this article
168Readers
Mendeley users who have this article in their library.

Abstract

Language identification is the task of automatically detecting the language(s) present in a document based on the content of the document. In this work, we address the problem of detecting documents that contain text from more than one language ( multilingual documents). We introduce a method that is able to detect that a document is multilingual, identify the languages present, and estimate their relative proportions. We demonstrate the effectiveness of our method over synthetic data, as well as real-world multilingual documents collected from the web.

Cite

CITATION STYLE

APA

Lui, M., Lau, J. H., & Baldwin, T. (2014). Automatic Detection and Language Identification of Multilingual Documents. Transactions of the Association for Computational Linguistics, 2, 27–40. https://doi.org/10.1162/tacl_a_00163

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free