Robust backed-off estimation of out-of-vocabulary embeddings

Nobukazu Fukuda; Naoki Yoshinaga; Masaru Kitsuregawa

Conference ProceedingsOPEN ACCESS

Robust backed-off estimation of out-of-vocabulary embeddings

Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 (2020) 4827-4838

DOI: 10.18653/v1/2020.findings-emnlp.434

18Citations

74Readers

Abstract

Out-of-vocabulary (OOV) words cause serious troubles in solving natural language tasks with a neural network. Existing approaches to this problem resort to using subwords, which are shorter and more ambiguous units than words, in order to represent OOV words with a bag of subwords. In this study, inspired by the processes for creating words from known words, we propose a robust method of estimating OOV word embeddings by referring to pre-trained word embeddings for known words with similar surfaces to target OOV words. We collect known words by segmenting OOV words and by approximate string matching, and we then aggregate their pre-trained embeddings. Experimental results show that the obtained OOV word embeddings improve not only word similarity tasks but also downstream tasks in Twitter and biomedical domains where OOV words often appear, even when the computed OOV embeddings are integrated into a BERT-based strong baseline.

Cite

CITATION STYLE

APA

Fukuda, N., Yoshinaga, N., & Kitsuregawa, M. (2020). Robust backed-off estimation of out-of-vocabulary embeddings. In Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 (pp. 4827–4838). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.findings-emnlp.434

Robust backed-off estimation of out-of-vocabulary embeddings

Abstract

Cite

Register to see more suggestions