Scaling density-based clustering to large collections of sets

Daniel Kocher; Nikolaus Augsten; Willi Mann

Conference Proceedings

Scaling density-based clustering to large collections of sets

Advances in Database Technology - EDBT (2021) 2021-March 109-120

DOI: 10.5441/002/edbt.2021.11

4Citations

1Readers

Get full text

Abstract

We study techniques for clustering large collections of sets into DBSCAN clusters. Sets are often used as a representation of complex objects to assess their similarity. The similarity of two objects is then computed based on the overlap of their set representations, for example, using Hamming distance. Clustering large collections of sets is challenging. A baseline that executes the standard DBSCAN algorithm suffers from poor performance due to the unfavorable neighborhood-by-neighborhood order in which the sets are retrieved. The DBSCAN order requires the use of a symmetric index, which is less effective than its asymmetric counterpart. Precomputing and materializing the neighborhoods to gain control over the retrieval order often turns out to be infeasible due to excessive memory requirements. We propose a new, density-based clustering algorithm that processes data points in any user-defined order and does not need to materialize neighborhoods. Instead, so-called backlinks are introduced that are sufficient to achieve a correct clustering. Backlinks require only linear space while there can be a quadratic number of neighbors. To the best of our knowledge, this is the first DBSCAN-compliant algorithm that can leverage asymmetric indexes in linear space. Our empirical evaluation suggests that our algorithm combines the best of two worlds: it achieves the runtime performance of materialization-based approaches while retaining the memory efficiency of non-materializing techniques.

Cite

CITATION STYLE

APA

Kocher, D., Augsten, N., & Mann, W. (2021). Scaling density-based clustering to large collections of sets. In Advances in Database Technology - EDBT (Vol. 2021-March, pp. 109–120). OpenProceedings.org. https://doi.org/10.5441/002/edbt.2021.11

Scaling density-based clustering to large collections of sets

Abstract

Cite

Register to see more suggestions