HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text

11Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Text generation is a highly active area of research in the computational linguistic community. The evaluation of the generated text is a challenging task and multiple theories and metrics have been proposed over the years. Unfortunately, text generation and evaluation are relatively understudied due to the scarcity of high-quality resources in code-mixed languages where the words and phrases from multiple languages are mixed in a single utterance of text and speech. To address this challenge, we present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages). HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences. In addition, we demonstrate the inefficacy of widely-used evaluation metrics on the code-mixed data. The HinGE dataset will facilitate the progress of natural language generation research in code-mixed languages.

Cite

CITATION STYLE

APA

Srivastava, V., & Singh, M. (2021). HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text. In Eval4NLP 2021 - Evaluation and Comparison of NLP Systems, Proceedings of the 2nd Workshop (pp. 200–208). Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-056-4_020

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free