Abstract
Subword segmentation methods are the predominant solution to vocab sparsity in NMT. However, they cannot currently handle capitalization well. We re-encode case to allow the perplexity-driven SPM unigram language model algorithm to learn how to segment capitalization. Since naturally occurring data accurately describes the prevalence of capitalization but underestimates the importance humans ascribe to capitalization robustness, we propose data augmentation to fill this gap. We demonstrate that our proposed method improves translation quality on ALL CAPS, lower cased, and Title Case, while maintaining quality on standard test sets. In contrast to prior work, our proposed method has minimal impact on decoding speed. We release our code: github.com/marian-nmt/sentencepiece.
Cite
CITATION STYLE
Jain, R., Khayrallah, H., Grundkiewicz, R., & Junczys-Dowmunt, M. (2023). Perplexity-Driven Case Encoding Needs Augmentation for CAPITALIZATION Robustness. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Long Papers, IJCNLP-AACL 2023 (Vol. 2, pp. 146–156). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.ijcnlp-short.17
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.