Study of different document representation models for finding phrase-based similarity

Preeti Kathiria; Harshal Arolkar

Conference Proceedings

Study of different document representation models for finding phrase-based similarity

Smart Innovation, Systems and Technologies (2019) 106 455-464

DOI: 10.1007/978-981-13-1742-2_45

4Citations

1Readers

Get full text

Abstract

To find phrase-based similarity among documents, it should first analyze the text data stored within the document before applying any machine learning algorithms. As the analysis on textual data is difficult, the text is needed to be broken into words, phrases, or converted to numerical measure. To convert text data into numerical measure, the well-known bag-of-words with term frequency model or TF-IDF model can be used. The converted numerical data, broken words or phrases, are to be stored in some form like vector, tree, or graph known as document representation model. The focus of this paper is to show how different document representation models can store words, phrases, or converted numerical data to find phrase-based similarity. Phrase-based similarity methods make use of word proximity so it can be used to find syntactic similarities between documents in a corpus. The similarity is calculated based on the frequency of words or frequency of phrases in sentences. This paper analyzes and compares different representation models on different parameters to find phrase-based similarity.

Author supplied keywords

Cite

CITATION STYLE

APA

Kathiria, P., & Arolkar, H. (2019). Study of different document representation models for finding phrase-based similarity. In Smart Innovation, Systems and Technologies (Vol. 106, pp. 455–464). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-13-1742-2_45

Study of different document representation models for finding phrase-based similarity

Abstract

Author supplied keywords

Cite

Register to see more suggestions