To find phrase-based similarity among documents, it should first analyze the text data stored within the document before applying any machine learning algorithms. As the analysis on textual data is difficult, the text is needed to be broken into words, phrases, or converted to numerical measure. To convert text data into numerical measure, the well-known bag-of-words with term frequency model or TF-IDF model can be used. The converted numerical data, broken words or phrases, are to be stored in some form like vector, tree, or graph known as document representation model. The focus of this paper is to show how different document representation models can store words, phrases, or converted numerical data to find phrase-based similarity. Phrase-based similarity methods make use of word proximity so it can be used to find syntactic similarities between documents in a corpus. The similarity is calculated based on the frequency of words or frequency of phrases in sentences. This paper analyzes and compares different representation models on different parameters to find phrase-based similarity.
CITATION STYLE
Kathiria, P., & Arolkar, H. (2019). Study of different document representation models for finding phrase-based similarity. In Smart Innovation, Systems and Technologies (Vol. 106, pp. 455–464). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-13-1742-2_45
Mendeley helps you to discover research relevant for your work.