Perplexity-Driven Case Encoding Needs Augmentation for CAPITALIZATION Robustness

0Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Subword segmentation methods are the predominant solution to vocab sparsity in NMT. However, they cannot currently handle capitalization well. We re-encode case to allow the perplexity-driven SPM unigram language model algorithm to learn how to segment capitalization. Since naturally occurring data accurately describes the prevalence of capitalization but underestimates the importance humans ascribe to capitalization robustness, we propose data augmentation to fill this gap. We demonstrate that our proposed method improves translation quality on ALL CAPS, lower cased, and Title Case, while maintaining quality on standard test sets. In contrast to prior work, our proposed method has minimal impact on decoding speed. We release our code: github.com/marian-nmt/sentencepiece.

Cite

CITATION STYLE

APA

Jain, R., Khayrallah, H., Grundkiewicz, R., & Junczys-Dowmunt, M. (2023). Perplexity-Driven Case Encoding Needs Augmentation for CAPITALIZATION Robustness. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Long Papers, IJCNLP-AACL 2023 (Vol. 2, pp. 146–156). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.ijcnlp-short.17

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free