IndoNLI: A Natural Language Inference Dataset for Indonesian

Rahmad Mahendra; Alham Fikri Aji; Samuel Louvan; Fahrurrozi Rahman; Clara Vania

Conference ProceedingsOPEN ACCESS

IndoNLI: A Natural Language Inference Dataset for Indonesian

EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings (2021) 10511-10527

DOI: 10.18653/v1/2021.emnlp-main.821

21Citations

105Readers

Abstract

We present IndoNLI, the first human-elicited NLI dataset for Indonesian. We adapt the data collection protocol for MNLI and collect ∼18K sentence pairs annotated by crowd workers and experts. The expert-annotated data is used exclusively as a test set. It is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerical reasoning, structural changes, idioms, or temporal and spatial reasoning. Experiment results show that XLM-R outperforms other pretrained models in our data. The best performance on the expert-annotated data is still far below human performance (13.4% accuracy gap), suggesting that this test set is especially challenging. Furthermore, our analysis shows that our expert-annotated data is more diverse and contains fewer annotation artifacts than the crowd-annotated data. We hope this dataset can help accelerate progress in Indonesian NLP research.

Cite

CITATION STYLE

APA

Mahendra, R., Aji, A. F., Louvan, S., Rahman, F., & Vania, C. (2021). IndoNLI: A Natural Language Inference Dataset for Indonesian. In EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 10511–10527). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.emnlp-main.821

IndoNLI: A Natural Language Inference Dataset for Indonesian

Abstract

Cite

Register to see more suggestions