Sign up & Download
Sign in

Metric Index: An Efficient and Scalable Solution for Similarity Search

by David Novak, Michal Batko
Information Systems Journal ()

Abstract

Metric space is a universal and versatile model of similarity that can be applied in various areas of information retrieval. However, a general, efficient, and scalable solution for metric data management is still a resisting research challenge. We introduce a novel indexing and searching mechanism called Metric Index (M-Index) that employs practically all known principles of metric space partitioning, pruning, and filtering, thus reaching high search performance while having constant building costs per object. The heart of the M-Index is a general mapping mechanism that enables to actually store the data in established structures such as the B+-tree or even in a distributed storage. We implemented the M-Index with the B+-tree and performed experiments on two datasets-the first is an artificial set of vectors and the other is a real-life dataset composed of a combination of five MPEG-7 visual descriptors extracted from a database of up to several million digital images. The experiments put several M-Index variants under test and compare them with established techniques for both precise and approximate similarity search. The trials show that the M-Index outperforms the others in terms of efficiency of search-space pruning, I/O costs, and response times for precise similarity queries. Further, the M-Index demonstrates excellent ability to keep similar data close in the index which makes its approximation algorithm very efficient - maintaining practically constant response times while preserving a very high recall as the dataset grows and even beating approaches designed purely for approximate search.

Cite this document (BETA)

Available from ieeexplore.ieee.org
Page 1
hidden

Metric Index: An Efficient and Sc...

Metric Index: An efficient and scalable solution for precise and approximate similarity search David Novak , Michal Batko, Pavel Zezula Masaryk University, Brno, Czech Republic a r t i c l e i n f o Available online 28 October 2010 Keywords: Metric space Similarity search Data structure Approximation Scalability a b s t r a c t Metric space is a universal and versatile model of similarity that can be applied in various areas of information retrieval. However, a general, efficient, and scalable solution for metric data management is still a resisting research challenge. We introduce a novel indexing and searching mechanism called Metric Index (M-Index) that employs practically all known principles of metric space partitioning, pruning, and filtering, thus reaching high search performance while having constant building costs per object. The heart of the M-Index is a general mapping mechanism that enables to actually store the data in established structures such as the B + -tree or even in a distributed storage. We implemented the M-Index with the B + -tree and performed experiments on two datasets���the first is an artificial set of vectors and the other is a real-life dataset composed of a combination of five MPEG-7 visual descriptors extracted from a database of up to several million digital images. The experiments put several M-Index variants under test and compare them with established techniques for both precise and approximate similarity search. The trials show that the M-Index outperforms the others in terms of efficiency of search-space pruning, I/O costs, and response times for precise similarity queries. Further, the M-Index demonstrates excellent ability to keep similar data close in the index which makes its approximation algorithm very efficient���maintaining practically constant response times while preserving a very high recall as the dataset grows and even beating approaches designed purely for approximate search. & 2010 Elsevier B.V. All rights reserved. 1. Introduction There are many indexing techniques that focus on processing textual or vector data. These techniques are not always sufficient for current digital data types or their efficiency is significantly reduced, for example because of the phenomenon referred to as the curse of dimensionality [1]. Metric space, as a very general data abstraction, allows to grasp a wider variety of these data types. However, after more than a decade of research, efficiency and scalability of metric access methods is still an issue. In this paper, we introduce a novel index structure for metric data which builds upon a long-term research in this area. Metric Index (M-Index) defines a universal mapping schema from a generic metric space to a numeric domain. This schema has the ability to preserve proximity of data, i.e. it maps similar metric objects to close numbers in the numeric domain. The M-Index indexing and searching mechanisms use a set of reference objects and synergically exploit practically all known metric-based principles of data partitioning, pruning and filtering. At the same time, having a fixed set of reference objects, M-Index has fixed building costs in terms of number of metric-function evaluations during the insertion of a data object. The mapping nature of the M-Index separates its principles from the specific storage structure and thus enables to use well-established techniques such as the Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/infosys Information Systems 0306-4379/$ - see front matter & 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.is.2010.10.002 Corresponding author. E-mail addresses: david.novak@fi.muni.cz (D. Novak), batko@fi.muni.cz (M. Batko), zezula@fi.muni.cz (P. Zezula). Information Systems 36 (2011) 721���733

Authors on Mendeley

Readership Statistics

15 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
33% Ph.D. Student
 
13% Researcher (at an Academic Institution)
 
13% Associate Professor
by Country
 
33% Czech Republic
 
13% United States
 
7% Italy

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in