Mining multilingual and multiscript Twitter data: Unleashing the language and script barrier

Bidhan Sarkar; Nilanjan Sinhababu; Manob Roy; Pijush Kanti Dutta Pramanik; Prasenjit Choudhury

Journal ArticleOPEN ACCESS

Mining multilingual and multiscript Twitter data: Unleashing the language and script barrier

International Journal of Business Intelligence and Data Mining (2020) 16(1) 1-19

DOI: 10.1504/IJBIDM.2020.103847

14Citations

15Readers

Abstract

Micro-blogging sites like Twitter have become an opinion hub where views on diverse topics are expressed. Interpreting, comprehending and analysing this emotion-rich information can unearth many valuable insights. The job is trivial if the tweets are in English. But lately, increase in native languages for communication has imposed a great challenge in social media mining. Things become more complicated when people use Roman scripts to write non-English languages. India, being a country with a diverse collection of scripts and languages, encounters the problem severely. We have developed a system that automatically identifies and classifies native tweets, irrespective of the script used. Converting all tweets to English, we get rid of the 'script vs language' problem. The new approach we formulated consists of Script Identification, Language analysis, and Clustered mining. Considering English and the top two Indian languages, we found that the proposed framework gives better precision than the prevailing approaches.

Author supplied keywords

Cite

CITATION STYLE

APA

Sarkar, B., Sinhababu, N., Roy, M., Pramanik, P. K. D., & Choudhury, P. (2020). Mining multilingual and multiscript Twitter data: Unleashing the language and script barrier. International Journal of Business Intelligence and Data Mining, 16(1), 1–19. https://doi.org/10.1504/IJBIDM.2020.103847

Mining multilingual and multiscript Twitter data: Unleashing the language and script barrier

Abstract

Author supplied keywords

Cite

Register to see more suggestions