Robust backed-off estimation of out-of-vocabulary embeddings

18Citations
Citations of this article
74Readers
Mendeley users who have this article in their library.

Abstract

Out-of-vocabulary (OOV) words cause serious troubles in solving natural language tasks with a neural network. Existing approaches to this problem resort to using subwords, which are shorter and more ambiguous units than words, in order to represent OOV words with a bag of subwords. In this study, inspired by the processes for creating words from known words, we propose a robust method of estimating OOV word embeddings by referring to pre-trained word embeddings for known words with similar surfaces to target OOV words. We collect known words by segmenting OOV words and by approximate string matching, and we then aggregate their pre-trained embeddings. Experimental results show that the obtained OOV word embeddings improve not only word similarity tasks but also downstream tasks in Twitter and biomedical domains where OOV words often appear, even when the computed OOV embeddings are integrated into a BERT-based strong baseline.

Cite

CITATION STYLE

APA

Fukuda, N., Yoshinaga, N., & Kitsuregawa, M. (2020). Robust backed-off estimation of out-of-vocabulary embeddings. In Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 (pp. 4827–4838). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.findings-emnlp.434

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free