Sign up & Download
Sign in

SPEC Hashing : Similarity Preserving algorithm for Entropy-based Coding

by Ruei-sung Lin, David A Ross, Mountain View, Jay Yagnik
Analysis ()

Abstract

Searching approximate nearest neighbors in large scale high dimensional data set has been a challenging problem. This paper presents a novel and fast algorithm for learning binary hash functions for fast nearest neighbor retrieval. The nearest neighbors are defined according to the seman- tic similarity between the objects. Our method uses the in- formation of these semantic similarities and learns a hash function with binary code such that only objects with high similarity have small Hamming distance. The hash function is incrementally trained one bit at a time, and as bits are added to the hash code Hamming distances between dis- similar objects increase. We further link our method to the idea of maximizing conditional entropy among pair of bits and derive an extremely efficient linear time hash learn- ing algorithm. Experiments on similar image retrieval and celebrity face recognition show that our method produces apparent improvement in performance over some state-of- the-art methods.

Cite this document (BETA)

Available from www.cs.toronto.edu
Page 1
hidden

SPEC Hashing : Similarity Preserv...

SPEC Hashing: Similarity Preserving algorithm for Entropy-based Coding Ruei-Sung Lin David A. Ross Jay Yagnik Google Inc. Mountain View, CA 94043 {rslin, dross, jyagnik}@google.com Abstract Searching approximate nearest neighbors in large scale high dimensional data set has been a challenging problem. This paper presents a novel and fast algorithm for learning binary hash functions for fast nearest neighbor retrieval. The nearest neighbors are defined according to the seman- tic similarity between the objects. Our method uses the in- formation of these semantic similarities and learns a hash function with binary code such that only objects with high similarity have small Hamming distance. The hash function is incrementally trained one bit at a time, and as bits are added to the hash code Hamming distances between dis- similar objects increase. We further link our method to the idea of maximizing conditional entropy among pair of bits and derive an extremely efficient linear time hash learn- ing algorithm. Experiments on similar image retrieval and celebrity face recognition show that our method produces apparent improvement in performance over some state-of- the-art methods. 1. Introduction With the advance of Internet, we are inundated with an abundance of data of images, documents, music, videos, etc. As the size of the data continues to grow, the density of similar objects in the data space also increases. These objects are likely to have similar semantics. As a result, in- ferences based on nearest neighbors can be more reliable than ever before. In this paper, we describe a new learning-based hashing algorithm for nearest neighbor search in high dimensional feature space. Our nearest neighbors are objects with sim- ilar semantics. The trained hash function map objects to binary vectors such that the neighboring objects have small Hamming distances between their codes, while irrelevant objects have large distances. Therefore, we can use these binary vectors for fast semantic nearest-neighbor retrieval. Learning our hash function takes time linear to the data size and is fast. This makes our algorithm feasible to tasks with an evolving dataset, in which periodically updating or re- training the hash function is required. Searching nearest neighbors in sublinear time has been an ongoing research. Traditional methods such as the KD- tree [1] works well on data with limited feature dimen- sionality, but become linear time search as dimensionality grows. Recently, Locality Sensitive Hashing (LSH) [2, 3] has been successfully applied to datasets with high dimen- sional features. It uses random projections to map objects from feature space to bits, and treats these bits as keys for multiple hash tables. As a result, the collision of similar samples in at least one hash bucket has high probability. This randomized algorithm has tight asymptotic bound, and provides the foundation to a number of follow-up works. Parameter sensitive hashing [10] is one such extension. It chooses a set of weak binary classifiers to generate bits for the hash keys. The classifiers are selected according to the criteria that nearby objects are more likely to have the same class label than more distant objects. Ke et. al. [4] adopt a similar idea, and formulate the learning problem within the boosting framework. A major drawback of this type of approach is the requirement of evaluation on object pairs, which has size quadratic to the number of objects. Hence, its scalability to larger scale dataset is limited. Salakhutdinov et. al. [9] use restricted Boltzmann ma- chines (RBM) to learn the hash function, and show that the learned hash codes preserve semantic similarity in Ham- ming space. This approach is then applied to the task of similarity search in millions of images [11]. Training RBM is a computationally intensive process that makes it very costly to re-train the hash function when data evolve. Spec- tral Hashing [14] takes a completely different approach to generate hash code. It first rotates the feature space to sta- tistically orthogonal axes using PCA. Then, a special basis function is applied to carve each axis independently to gen- erate hash bits. As a result, bits in the hash code are all inde- pendent, which leads to a compact representation with short code length. Experiments in [14] show it outperforms RBM and the boosting approach. Spectral Hashing is developed on the assumption that objects are spread in an Euclidean

Readership Statistics

54 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
35% Ph.D. Student
 
20% Student (Master)
 
9% Student (Bachelor)
by Country
 
26% China
 
13% United Kingdom
 
9% United States

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in