Thai Nested Named Entity Recognition Corpus

Weerayut Buaphet; Can Udomcharoenchaikit; Peerat Limkonchotiwat; Attapol T. Rutherford; Sarana Nutanong

Conference ProceedingsOPEN ACCESS

Thai Nested Named Entity Recognition Corpus

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2022) 1473-1486

DOI: 10.18653/v1/2022.findings-acl.116

9Citations

35Readers

Abstract

This paper presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from news articles and restaurant reviews, a total of 4894 documents. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes. To understand the new challenges our proposed dataset brings to the field, we conduct an experimental study on (i) cutting edge N-NER models with the state-of-the-art accuracy in English and (ii) baseline methods based on well-known language model architectures. From the experimental results, we obtain two key findings. First, all models produce poor F1 scores in the tail region of the class distribution. There is little or no performance improvement provided by these models with respect to the baseline methods with our Thai dataset. These findings suggest that further investigation is required to make a multilingual N-NER solution that works well across different languages. The dataset and code are available at: github.com/vistec-AI/Thai-NNER.git.

Cite

CITATION STYLE

APA

Buaphet, W., Udomcharoenchaikit, C., Limkonchotiwat, P., Rutherford, A. T., & Nutanong, S. (2022). Thai Nested Named Entity Recognition Corpus. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 1473–1486). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.findings-acl.116

Thai Nested Named Entity Recognition Corpus

Abstract

Cite

Register to see more suggestions