In this paper, we present the process of training the word embedding (WE) model for a small, domain-specific Malay corpus. In this study, Hansard corpus of Malaysia Parliament for specific years was trained on the Word2vec model. However, a specific setting of the hyperparameters is required to obtain an accurate WE model because changing one of the hyperparameters would affect the model’s performance. We trained the corpus into a series of WE model on a set of hyperparameters where one of the parameter values was different from each model. The model performances were intrinsically evaluated using three semantic word relations, namely; word similarity, dissimilarity and analogy. The evaluation was performed based on the model output and analysed by experts (corpus linguists). Experts’ evaluation result on a small, domain-specific corpus showed that the suitable hyperparameters were a window size of 5 or 10, a vector size of 50 to 100 and Skip-gram architecture.
CITATION STYLE
Tiun, S., Nor, N. F. M., Jalaludin, A., & Rahman, A. N. C. A. (2020). Word Embedding for Small and Domain-specific Malay Corpus. In Lecture Notes in Electrical Engineering (Vol. 603, pp. 435–443). Springer Verlag. https://doi.org/10.1007/978-981-15-0058-9_42
Mendeley helps you to discover research relevant for your work.