Document nearest neighbors query based on pairwise similarity with MapReduce

0Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

With the continuous development of Web technology, many Internet issues evolve into Big Data problems, characterized by volume, variety, velocity and variability. Among them, how to organize plenty of web pages and retrieval information needed is a critical one. An important notion is document classification, in which nearest neighbors query is the key issue to be solved. Most parallel nearest neighbors query methods adopt Cartesian Product between training set and testing set resulting in poor time efficiency. In this paper, two methods are proposed on document nearest neighbor query based on pairwise similarity, i.e. brute-force and pre-filtering. brute-force is constituted by two phases (i.e. copying and filtering) and one map-reduce procedure is conducted. In order to obtain nearest neighbors for each document, each document pair is copied twice and all records generated are shuffled. However, time efficiency of shuffle is sensitive to the number of the intermediate results. For the purpose of intermediate results reduction, pre-filtering is proposed for nearest neighbor query based on pairwise similarity. Since only first top-k neighbors are output for each document, the size of records shuffled is kept in the same magnitude as input size in pre-filtering. Additionally, detailed theoretical analysis is provided. The performance of the algorithms is demonstrated by experiments on real world dataset.

Cite

CITATION STYLE

APA

Lv, P., Yang, P., Dong, Y. Q., & Gu, L. (2018). Document nearest neighbors query based on pairwise similarity with MapReduce. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11334 LNCS, pp. 34–45). Springer Verlag. https://doi.org/10.1007/978-3-030-05051-1_3

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free