Word Embedding for Small and Domain-specific Malay Corpus

3Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we present the process of training the word embedding (WE) model for a small, domain-specific Malay corpus. In this study, Hansard corpus of Malaysia Parliament for specific years was trained on the Word2vec model. However, a specific setting of the hyperparameters is required to obtain an accurate WE model because changing one of the hyperparameters would affect the model’s performance. We trained the corpus into a series of WE model on a set of hyperparameters where one of the parameter values was different from each model. The model performances were intrinsically evaluated using three semantic word relations, namely; word similarity, dissimilarity and analogy. The evaluation was performed based on the model output and analysed by experts (corpus linguists). Experts’ evaluation result on a small, domain-specific corpus showed that the suitable hyperparameters were a window size of 5 or 10, a vector size of 50 to 100 and Skip-gram architecture.

Cite

CITATION STYLE

APA

Tiun, S., Nor, N. F. M., Jalaludin, A., & Rahman, A. N. C. A. (2020). Word Embedding for Small and Domain-specific Malay Corpus. In Lecture Notes in Electrical Engineering (Vol. 603, pp. 435–443). Springer Verlag. https://doi.org/10.1007/978-981-15-0058-9_42

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free