Determining Cosine Similarity Neighborhoods by Means of the Euclidean Distance

Marzena Kryszkiewicz

Journal Article

Determining Cosine Similarity Neighborhoods by Means of the Euclidean Distance

Kryszkiewicz M

Intelligent Systems Reference Library (2013) 43 323-345

DOI: 10.1007/978-3-642-30341-8_17

4Citations

5Readers

Get full text

Abstract

Cosine similarity measure is often applied in the area of information retrieval, text classification, clustering, and ranking, where documents are usually represented as term frequency vectors or its variants such as tf-idf vectors. In these tasks, the most time-consuming operation is the calculation of most similar vectors or, alternatively, least dissimilar vectors. This operation has been commonly believed to be inefficient for large high-dimensional datasets. However, using the triangle inequality to determine neighborhoods based on a distance metric, offered recently, makes this operation feasible for such datasets. Although the cosine similarity measure is not a distance metric and, in particular, violates the triangle inequality, in this chapter, we present how to determine cosine similarity neighborhoods of vectors by means of the Euclidean distance applied to (α -) normalized forms of these vectors and by using the triangle inequality. We address three types of sets of cosine similar vectors: all vectors, the similarity of which to a given vector is not less than an ε threshold value, and two variants of the k-nearest neighbors of a given vector. © Springer-Verlag Berlin Heidelberg 2013.

Author supplied keywords

Cite

CITATION STYLE

APA

Kryszkiewicz, M. (2013). Determining Cosine Similarity Neighborhoods by Means of the Euclidean Distance. Intelligent Systems Reference Library, 43, 323–345. https://doi.org/10.1007/978-3-642-30341-8_17

Determining Cosine Similarity Neighborhoods by Means of the Euclidean Distance

Abstract

Author supplied keywords

Cite

Register to see more suggestions