Named entity recognition in the Indonesian language has significantly developed in recent years. However, it still lacks standardized publicly available corpora; a small dataset is available but suffers from inconsistent annotations. Therefore, we re-annotated the dataset to improve its consistency and benefit the community. Our re-annotation led to better training results from an effective baseline model consisting of bidirectional long short-term memory and conditional random fields. To fully utilize the limited available data, we utilized better contextualization and transferred external knowledge by exploiting monolingual and multilingual pre-trained language models, such as IndoBERT and XLM-RoBERTa. In addition to the general improvement from the language models, we observed that the monolingual model is more sensitive, while the multilingual ones show advantages in rich morphological knowledge. We also applied cross-lingual transfer learning to utilize high-resource corpora in other languages. We adopted English, Spanish, Dutch, and German as the source languages for the target Indonesian language and found that Dutch plays a special role in the data transfer method due to morphological similarity attributable to historical reasons.
CITATION STYLE
Khairunnisa, S. O., Chen, Z., & Komachi, M. (2023). Dataset Enhancement and Multilingual Transfer for Named Entity Recognition in the Indonesian Language. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(6). https://doi.org/10.1145/3592854
Mendeley helps you to discover research relevant for your work.