Dataset Enhancement and Multilingual Transfer for Named Entity Recognition in the Indonesian Language

Siti Oryza Khairunnisa; Zhousi Chen; Mamoru Komachi

Journal ArticleOPEN ACCESS

Dataset Enhancement and Multilingual Transfer for Named Entity Recognition in the Indonesian Language

ACM Transactions on Asian and Low-Resource Language Information Processing (2023) 22(6)

DOI: 10.1145/3592854

3Citations

24Readers

Abstract

Named entity recognition in the Indonesian language has significantly developed in recent years. However, it still lacks standardized publicly available corpora; a small dataset is available but suffers from inconsistent annotations. Therefore, we re-annotated the dataset to improve its consistency and benefit the community. Our re-annotation led to better training results from an effective baseline model consisting of bidirectional long short-term memory and conditional random fields. To fully utilize the limited available data, we utilized better contextualization and transferred external knowledge by exploiting monolingual and multilingual pre-trained language models, such as IndoBERT and XLM-RoBERTa. In addition to the general improvement from the language models, we observed that the monolingual model is more sensitive, while the multilingual ones show advantages in rich morphological knowledge. We also applied cross-lingual transfer learning to utilize high-resource corpora in other languages. We adopted English, Spanish, Dutch, and German as the source languages for the target Indonesian language and found that Dutch plays a special role in the data transfer method due to morphological similarity attributable to historical reasons.

Author supplied keywords

Cite

CITATION STYLE

APA

Khairunnisa, S. O., Chen, Z., & Komachi, M. (2023). Dataset Enhancement and Multilingual Transfer for Named Entity Recognition in the Indonesian Language. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(6). https://doi.org/10.1145/3592854

Dataset Enhancement and Multilingual Transfer for Named Entity Recognition in the Indonesian Language

Abstract

Author supplied keywords

Cite

Register to see more suggestions