Exact and efficient proximity graph computation

Michail Kazimianec; Nikolaus Augsten

Conference Proceedings

Exact and efficient proximity graph computation

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2010) 6295 LNCS 289-304

DOI: 10.1007/978-3-642-15576-5_23

2Citations

3Readers

Get full text

Abstract

Graph Proximity Cleansing (GPC) is a string clustering algorithm that automatically detects cluster borders and has been successfully used for string cleansing. For each potential cluster a so-called proximity graph is computed, and the cluster border is detected based on the proximity graph. Unfortunately, the computation of the proximity graph is expensive and the state-of-the-art GPC algorithms only approximate the proximity graph using a sampling technique. In this paper we propose two efficient algorithms for the exact computation of proximity graphs. The first algorithm, PG-DS, is based on a divide-skip technique for merging inverted lists, the second algorithm, PG-SM, uses a sort-merge join strategy to compute the proximity graph. While the state-of-the-art solutions only approximate the correct proximity graph, our algorithms are exact. We experimentally evaluate our solution on large real world datasets and show that our algorithms are faster than the sampling-based approximation algorithms, even for very small sample sizes. © 2010 Springer-Verlag.

Cite

CITATION STYLE

APA

Kazimianec, M., & Augsten, N. (2010). Exact and efficient proximity graph computation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6295 LNCS, pp. 289–304). https://doi.org/10.1007/978-3-642-15576-5_23

Exact and efficient proximity graph computation

Abstract

Cite

Register to see more suggestions