Word embedding technique is among the most widely known and used representations of text documents vocabulary. It serves to capture word context in a document, but in many applications the need is to understand the content of text, which is longer than just a single word, that’s what we call “Document Embedding”. This paper presents an empirical study that evaluates the morphosyntactic data preprocessing impact on document embedding techniques over textual semantic similarity evaluation task, and that by comparing the impact of the most widely known text preprocessing techniques, such as: (1) Cleaning technique containing stop-words removal, lowercase conversion, punctuation and number elimination, (2) Stemming technique using the most known algorithms in the literature: Porter, Snowball and Lancaster stemmer and (3) Lemmatization technique using Wordnet Lemmatizer. Experimental analysis on MSRP (Microsoft Research Paraphrase) dataset reveals that preprocessing techniques improve classifier accuracy, where Stemming methods outperforms other techniques.
CITATION STYLE
Yahi, N., & Belhadef, H. (2020). Morphosyntactic preprocessing impact on document embedding: An empirical study on semantic similarity. In Advances in Intelligent Systems and Computing (Vol. 1073, pp. 118–126). Springer. https://doi.org/10.1007/978-3-030-33582-3_12
Mendeley helps you to discover research relevant for your work.