This article describes the language identification system used by the SUKI team in the 2022 Nuanced Arabic Dialect Identification (NADI) shared task. In addition to the system description, we give some details of the dialect identification experiments we conducted while preparing our submissions. In the end, we submitted only one official run. We used a Naive Bayes-based language identifier with character n-grams from one to four, of which we implemented a new version, which automatically optimizes its parameters. We also experimented with clustering the training data according to different topics. With the macro F1 score of 0.1963 on test set A and 0.1058 on test set B, we achieved the 18th position out of the 19 competing teams.
CITATION STYLE
Jauhiainen, T., Jauhiainen, H., & Lindén, K. (2022). Optimizing Naive Bayes for Arabic Dialect Identification. In WANLP 2022 - 7th Arabic Natural Language Processing - Proceedings of the Workshop (pp. 409–414). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.wanlp-1.40
Mendeley helps you to discover research relevant for your work.