Word Embedding for Small and Domain-specific Malay Corpus

Sabrina Tiun; Nor Fariza Mohd Nor; Azhar Jalaludin; Anis Nadiah Che Abdul Rahman

Conference Proceedings

Word Embedding for Small and Domain-specific Malay Corpus

Lecture Notes in Electrical Engineering (2020) 603 435-443

DOI: 10.1007/978-981-15-0058-9_42

3Citations

6Readers

Get full text

Abstract

In this paper, we present the process of training the word embedding (WE) model for a small, domain-specific Malay corpus. In this study, Hansard corpus of Malaysia Parliament for specific years was trained on the Word2vec model. However, a specific setting of the hyperparameters is required to obtain an accurate WE model because changing one of the hyperparameters would affect the model’s performance. We trained the corpus into a series of WE model on a set of hyperparameters where one of the parameter values was different from each model. The model performances were intrinsically evaluated using three semantic word relations, namely; word similarity, dissimilarity and analogy. The evaluation was performed based on the model output and analysed by experts (corpus linguists). Experts’ evaluation result on a small, domain-specific corpus showed that the suitable hyperparameters were a window size of 5 or 10, a vector size of 50 to 100 and Skip-gram architecture.

Author supplied keywords

Cite

CITATION STYLE

APA

Tiun, S., Nor, N. F. M., Jalaludin, A., & Rahman, A. N. C. A. (2020). Word Embedding for Small and Domain-specific Malay Corpus. In Lecture Notes in Electrical Engineering (Vol. 603, pp. 435–443). Springer Verlag. https://doi.org/10.1007/978-981-15-0058-9_42

Word Embedding for Small and Domain-specific Malay Corpus

Abstract

Author supplied keywords

Cite

Register to see more suggestions