Learning Chinese Word Embeddings with Words and Subcharacter N-Grams

Ruizhi Kang; Hongjun Zhang; Wenning Hao; Kai Cheng; Guanglu Zhang

Journal ArticleOPEN ACCESS

Learning Chinese Word Embeddings with Words and Subcharacter N-Grams

IEEE Access (2019) 7 42987-42992

DOI: 10.1109/ACCESS.2019.2908014

7Citations

9Readers

Abstract

Co-occurrence information between words is the basis of training word embeddings; besides, Chinese characters are composed of subcharacters, words made up by the same characters or subcharacters usually have similar semantics, but this internal substructure information is usually neglected in popular models. In this paper, we propose a novel method for learning Chinese word embeddings, which takes full use of external co-occurrence context information and internal substructure information. We represent each word as a bag of subcharacter n-grams, and our model learns the vector representation corresponding to the word and its subcharacter n-grams. The final word embeddings are represented as the sum of these two kinds of vector representation, which makes the learned word embeddings can take into account both the internal structure information and external co-occurrence information possible. The experiments show that our method outperforms state-of-the-art performance on benchmarks.

Author supplied keywords

Cite

CITATION STYLE

APA

Kang, R., Zhang, H., Hao, W., Cheng, K., & Zhang, G. (2019). Learning Chinese Word Embeddings with Words and Subcharacter N-Grams. IEEE Access, 7, 42987–42992. https://doi.org/10.1109/ACCESS.2019.2908014

Learning Chinese Word Embeddings with Words and Subcharacter N-Grams

Abstract

Author supplied keywords

Cite

Register to see more suggestions