Enhancing subword embeddings with open n-grams

Csaba Veres; Paul Kapustin

Conference ProceedingsOPEN ACCESS

Enhancing subword embeddings with open n-grams

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2020) 12089 LNCS 3-15

DOI: 10.1007/978-3-030-51310-8_1

5Citations

5Readers

Abstract

Using subword n-grams for training word embeddings makes it possible to subsequently compute vectors for rare and misspelled words. However, we argue that the subword vector qualities can be degraded for words which have a high orthographic neighbourhood; a property of words that has been extensively studied in the Psycholinguistic literature. Empirical findings about lexical neighbourhood effects constrain models of human word encoding, which must also be consistent with what we know about neurophysiological mechanisms in the visual word recognition system. We suggest that the constraints learned from humans provide novel insights to subword encoding schemes. This paper shows that vectors trained with subword properties informed by psycholinguistic evidence are superior to those trained with ad hoc n-grams. It is argued that physiological mechanisms for reading are key factors in the observed distribution of written word forms, and should therefore inform our choice of word encoding.

Cite

CITATION STYLE

APA

Veres, C., & Kapustin, P. (2020). Enhancing subword embeddings with open n-grams. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12089 LNCS, pp. 3–15). Springer. https://doi.org/10.1007/978-3-030-51310-8_1

Enhancing subword embeddings with open n-grams

Abstract

Cite

Register to see more suggestions