Collecting and annotating indian social media code-mixed corpora

Anupam Jamatia; Björn Gambäck; Amitava Das

Conference Proceedings

Collecting and annotating indian social media code-mixed corpora

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2018) 9624 LNCS 406-417

DOI: 10.1007/978-3-319-75487-1_32

3Citations

11Readers

Get full text

Abstract

The pervasiveness of social media in the present digital era has empowered the ‘netizens’ to be more creative and interactive, and to generate content using free language forms that often are closer to spoken language and hence show phenomena previously mainly analysed in speech. One such phenomenon is code-mixing, which occurs when multilingual persons switch freely between the languages they have in common. Code-mixing presents many new challenges for language processing and the paper discusses some of them, taking as a starting point the problems of collecting and annotating three corpora of code-mixed Indian social media text: one corpus with English-Bengali Twitter messages and two corpora containing English-Hindi Twitter and Facebook messages, respectively. We present statistics of these corpora, discuss part-of-speech tagging of the corpora using both a coarse-grained and a fine-grained tag set, and compare their complexity to several other code-mixed corpora based on a Code-Mixing Index.

Author supplied keywords

Cite

CITATION STYLE

APA

Jamatia, A., Gambäck, B., & Das, A. (2018). Collecting and annotating indian social media code-mixed corpora. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9624 LNCS, pp. 406–417). Springer Verlag. https://doi.org/10.1007/978-3-319-75487-1_32

Collecting and annotating indian social media code-mixed corpora

Abstract

Author supplied keywords

Cite

Register to see more suggestions