Indexability-Based Dataset Partitioning

Angello Hoyos; Ubaldo Ruiz; Stephane Marchand-Maillet; Edgar Chávez

Conference Proceedings

Indexability-Based Dataset Partitioning

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2019) 11807 LNCS 143-150

DOI: 10.1007/978-3-030-32047-8_13

4Citations

2Readers

Get full text

Abstract

Indexing exploits assumptions on the inner structures of a dataset to make the nearest neighbor queries cheaper to resolve. Datasets are generally indexed at once into a unique index for similarity search. By indexing a given dataset as a whole, one faces the parameters of its global structure, which may be adverse. A typical well-studied example is a high global dimensionality of the dataset, making any indexing strategy inefficient due to the curse of dimensionality. We conjecture that a dataset may be partitioned into subsets of variable indexability. The strategy is, therefore, to define a procedure to extract parts of the dataset with predictable indexability and to adapt the index structure to this parameter. In this paper, we define and discuss indexability related to the curse of dimensionality and propose a related heuristic to partition the dataset into low-dimensional parts. Each data object is ranked according to its degree centrality, under a connected sparse graph, the Half-Space Proximal Graph (HSP). We postulate centrality measures are good predictors of dimensionality and indexability. In view of validation, we conducted an experiment using the degree centrality of the HSP graph as unique dimensionality/indexability measure. We ranked the data objects by their respective centrality degree under the HSP graph, then extracted the lower dimensional subsets, recomputed the HSP and repeated. Subsets were then indexed with an exact method in increasing, decreasing and random order. We measured the complexity of a fixed set of queries for each of the three arrangements. For each set we used a fixed dataset with 250 queries. The above single experiment demonstrated that the heuristic can extract low dimensional subsets, and also that those subsets are easier to index. This initial results demonstrate the validity of our conjecture and motivate the need for exploring further the notion of indexability and related dataset partitioning strategies.

Author supplied keywords

Cite

CITATION STYLE

APA

Hoyos, A., Ruiz, U., Marchand-Maillet, S., & Chávez, E. (2019). Indexability-Based Dataset Partitioning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11807 LNCS, pp. 143–150). Springer. https://doi.org/10.1007/978-3-030-32047-8_13

Indexability-Based Dataset Partitioning

Abstract

Author supplied keywords

Cite

Register to see more suggestions