Morphosyntactic preprocessing impact on document embedding: An empirical study on semantic similarity

5Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Word embedding technique is among the most widely known and used representations of text documents vocabulary. It serves to capture word context in a document, but in many applications the need is to understand the content of text, which is longer than just a single word, that’s what we call “Document Embedding”. This paper presents an empirical study that evaluates the morphosyntactic data preprocessing impact on document embedding techniques over textual semantic similarity evaluation task, and that by comparing the impact of the most widely known text preprocessing techniques, such as: (1) Cleaning technique containing stop-words removal, lowercase conversion, punctuation and number elimination, (2) Stemming technique using the most known algorithms in the literature: Porter, Snowball and Lancaster stemmer and (3) Lemmatization technique using Wordnet Lemmatizer. Experimental analysis on MSRP (Microsoft Research Paraphrase) dataset reveals that preprocessing techniques improve classifier accuracy, where Stemming methods outperforms other techniques.

Cite

CITATION STYLE

APA

Yahi, N., & Belhadef, H. (2020). Morphosyntactic preprocessing impact on document embedding: An empirical study on semantic similarity. In Advances in Intelligent Systems and Computing (Vol. 1073, pp. 118–126). Springer. https://doi.org/10.1007/978-3-030-33582-3_12

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free