Language identification for user generated content in social media

1Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper we address the problem of Language Identification (LID) of user generated content in Social Media Communication (SMC). The existent LID solutions are very accurate in standard languages and normal texts. However, for non standard ones (i.e. SMC) this is still unreachable. To help resolve this problem, we present a language independent LID solution for non standard use of language, where we combine linguistic tools (morphology analyzers) and statistical models (language models) in a hybrid approach to identify the standard and non standard languages included in these SMC texts. Our solution treats also the Code Switching phenomenon between standard languages and dialect as well as the normalization of SMC special expressions and dialect, and finally the spelling correction of OOV words.

Cite

CITATION STYLE

APA

Zarnoufi, R., Jaafar, H., & Abik, M. (2019). Language identification for user generated content in social media. In Smart Innovation, Systems and Technologies (Vol. 111, pp. 672–678). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-03577-8_73

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free