In this paper we address the problem of Language Identification (LID) of user generated content in Social Media Communication (SMC). The existent LID solutions are very accurate in standard languages and normal texts. However, for non standard ones (i.e. SMC) this is still unreachable. To help resolve this problem, we present a language independent LID solution for non standard use of language, where we combine linguistic tools (morphology analyzers) and statistical models (language models) in a hybrid approach to identify the standard and non standard languages included in these SMC texts. Our solution treats also the Code Switching phenomenon between standard languages and dialect as well as the normalization of SMC special expressions and dialect, and finally the spelling correction of OOV words.
CITATION STYLE
Zarnoufi, R., Jaafar, H., & Abik, M. (2019). Language identification for user generated content in social media. In Smart Innovation, Systems and Technologies (Vol. 111, pp. 672–678). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-03577-8_73
Mendeley helps you to discover research relevant for your work.