Mining multilingual and multiscript Twitter data: Unleashing the language and script barrier

14Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.

Abstract

Micro-blogging sites like Twitter have become an opinion hub where views on diverse topics are expressed. Interpreting, comprehending and analysing this emotion-rich information can unearth many valuable insights. The job is trivial if the tweets are in English. But lately, increase in native languages for communication has imposed a great challenge in social media mining. Things become more complicated when people use Roman scripts to write non-English languages. India, being a country with a diverse collection of scripts and languages, encounters the problem severely. We have developed a system that automatically identifies and classifies native tweets, irrespective of the script used. Converting all tweets to English, we get rid of the 'script vs language' problem. The new approach we formulated consists of Script Identification, Language analysis, and Clustered mining. Considering English and the top two Indian languages, we found that the proposed framework gives better precision than the prevailing approaches.

Cite

CITATION STYLE

APA

Sarkar, B., Sinhababu, N., Roy, M., Pramanik, P. K. D., & Choudhury, P. (2020). Mining multilingual and multiscript Twitter data: Unleashing the language and script barrier. International Journal of Business Intelligence and Data Mining, 16(1), 1–19. https://doi.org/10.1504/IJBIDM.2020.103847

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free