Morphosyntactic preprocessing impact on document embedding: An empirical study on semantic similarity

Nourelhouda Yahi; Hacene Belhadef

Conference Proceedings

Morphosyntactic preprocessing impact on document embedding: An empirical study on semantic similarity

Advances in Intelligent Systems and Computing (2020) 1073 118-126

DOI: 10.1007/978-3-030-33582-3_12

5Citations

8Readers

Get full text

Abstract

Word embedding technique is among the most widely known and used representations of text documents vocabulary. It serves to capture word context in a document, but in many applications the need is to understand the content of text, which is longer than just a single word, that’s what we call “Document Embedding”. This paper presents an empirical study that evaluates the morphosyntactic data preprocessing impact on document embedding techniques over textual semantic similarity evaluation task, and that by comparing the impact of the most widely known text preprocessing techniques, such as: (1) Cleaning technique containing stop-words removal, lowercase conversion, punctuation and number elimination, (2) Stemming technique using the most known algorithms in the literature: Porter, Snowball and Lancaster stemmer and (3) Lemmatization technique using Wordnet Lemmatizer. Experimental analysis on MSRP (Microsoft Research Paraphrase) dataset reveals that preprocessing techniques improve classifier accuracy, where Stemming methods outperforms other techniques.

Author supplied keywords

Cite

CITATION STYLE

APA

Yahi, N., & Belhadef, H. (2020). Morphosyntactic preprocessing impact on document embedding: An empirical study on semantic similarity. In Advances in Intelligent Systems and Computing (Vol. 1073, pp. 118–126). Springer. https://doi.org/10.1007/978-3-030-33582-3_12

Morphosyntactic preprocessing impact on document embedding: An empirical study on semantic similarity

Abstract

Author supplied keywords

Cite

Register to see more suggestions