Study of different document representation models for finding phrase-based similarity

4Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

To find phrase-based similarity among documents, it should first analyze the text data stored within the document before applying any machine learning algorithms. As the analysis on textual data is difficult, the text is needed to be broken into words, phrases, or converted to numerical measure. To convert text data into numerical measure, the well-known bag-of-words with term frequency model or TF-IDF model can be used. The converted numerical data, broken words or phrases, are to be stored in some form like vector, tree, or graph known as document representation model. The focus of this paper is to show how different document representation models can store words, phrases, or converted numerical data to find phrase-based similarity. Phrase-based similarity methods make use of word proximity so it can be used to find syntactic similarities between documents in a corpus. The similarity is calculated based on the frequency of words or frequency of phrases in sentences. This paper analyzes and compares different representation models on different parameters to find phrase-based similarity.

Cite

CITATION STYLE

APA

Kathiria, P., & Arolkar, H. (2019). Study of different document representation models for finding phrase-based similarity. In Smart Innovation, Systems and Technologies (Vol. 106, pp. 455–464). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-13-1742-2_45

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free