Abstract
This paper presents several subword-modellingbased approaches to interlinear glossing for seven under-resourced languages as a part of the 2023 SIGMORPHON shared task on interlinear glossing (Ginn et al., 2023). In an interlinear glossed text (IGT), each line of the original text is paired with one or more corresponding lines which encode the underlying grammatical structure. While expert annotated glossed text is especially valuable for the study of low-resource languages in both theoretical linguistics and natural language processing, generating high-quality glossed data is expensive and time-consuming. Therefore, approaches which aim to automatically or semiautomatically generate glossed data can be valuable for linguistic research. We experiment with various augmentation and tokenization strategies for both the open and closed tracks of data. We found that while subword models may perform well for greater amounts of data, character-based approaches remain competitive in their performance in lower resource settings.
Cite
CITATION STYLE
Cross, Z., Yun, M., Apparaju, A., MacCabe, J., Nicolai, G., & Silfverberg, M. (2023). Glossy Bytes: Neural Glossing using Subword Encoding. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 222–229). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.sigmorphon-1.24
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.