Perplexity-Driven Case Encoding Needs Augmentation for CAPITALIZATION Robustness

Rohit Jain; Huda Khayrallah; Roman Grundkiewicz; Marcin Junczys-Dowmunt

Conference ProceedingsOPEN ACCESS

Perplexity-Driven Case Encoding Needs Augmentation for CAPITALIZATION Robustness

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Long Papers, IJCNLP-AACL 2023 (2023) 2 146-156

DOI: 10.18653/v1/2023.ijcnlp-short.17

0Citations

12Readers

Get full text

Abstract

Subword segmentation methods are the predominant solution to vocab sparsity in NMT. However, they cannot currently handle capitalization well. We re-encode case to allow the perplexity-driven SPM unigram language model algorithm to learn how to segment capitalization. Since naturally occurring data accurately describes the prevalence of capitalization but underestimates the importance humans ascribe to capitalization robustness, we propose data augmentation to fill this gap. We demonstrate that our proposed method improves translation quality on ALL CAPS, lower cased, and Title Case, while maintaining quality on standard test sets. In contrast to prior work, our proposed method has minimal impact on decoding speed. We release our code: github.com/marian-nmt/sentencepiece.

Cite

CITATION STYLE

APA

Jain, R., Khayrallah, H., Grundkiewicz, R., & Junczys-Dowmunt, M. (2023). Perplexity-Driven Case Encoding Needs Augmentation for CAPITALIZATION Robustness. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Long Papers, IJCNLP-AACL 2023 (Vol. 2, pp. 146–156). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.ijcnlp-short.17

Perplexity-Driven Case Encoding Needs Augmentation for CAPITALIZATION Robustness

Abstract

Cite

Register to see more suggestions