Language identification for user generated content in social media

Randa Zarnoufi; Hamid Jaafar; Mounia Abik

Conference Proceedings

Language identification for user generated content in social media

Smart Innovation, Systems and Technologies (2019) 111 672-678

DOI: 10.1007/978-3-030-03577-8_73

1Citations

5Readers

Get full text

Abstract

In this paper we address the problem of Language Identification (LID) of user generated content in Social Media Communication (SMC). The existent LID solutions are very accurate in standard languages and normal texts. However, for non standard ones (i.e. SMC) this is still unreachable. To help resolve this problem, we present a language independent LID solution for non standard use of language, where we combine linguistic tools (morphology analyzers) and statistical models (language models) in a hybrid approach to identify the standard and non standard languages included in these SMC texts. Our solution treats also the Code Switching phenomenon between standard languages and dialect as well as the normalization of SMC special expressions and dialect, and finally the spelling correction of OOV words.

Author supplied keywords

Cite

CITATION STYLE

APA

Zarnoufi, R., Jaafar, H., & Abik, M. (2019). Language identification for user generated content in social media. In Smart Innovation, Systems and Technologies (Vol. 111, pp. 672–678). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-03577-8_73

Language identification for user generated content in social media

Abstract

Author supplied keywords

Cite

Register to see more suggestions