Micro-blogging sites like Twitter have become an opinion hub where views on diverse topics are expressed. Interpreting, comprehending and analysing this emotion-rich information can unearth many valuable insights. The job is trivial if the tweets are in English. But lately, increase in native languages for communication has imposed a great challenge in social media mining. Things become more complicated when people use Roman scripts to write non-English languages. India, being a country with a diverse collection of scripts and languages, encounters the problem severely. We have developed a system that automatically identifies and classifies native tweets, irrespective of the script used. Converting all tweets to English, we get rid of the 'script vs language' problem. The new approach we formulated consists of Script Identification, Language analysis, and Clustered mining. Considering English and the top two Indian languages, we found that the proposed framework gives better precision than the prevailing approaches.
CITATION STYLE
Sarkar, B., Sinhababu, N., Roy, M., Pramanik, P. K. D., & Choudhury, P. (2020). Mining multilingual and multiscript Twitter data: Unleashing the language and script barrier. International Journal of Business Intelligence and Data Mining, 16(1), 1–19. https://doi.org/10.1504/IJBIDM.2020.103847
Mendeley helps you to discover research relevant for your work.