Bag-of-words image representation: Key ideas and further insight

27Citations
Citations of this article
18Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In the context of object and scene recognition, state-of-the-art performances are obtained with visual Bag-of-Words (BoW) models of mid-level representations computed from dense sampled local descriptors (e.g., Scale-Invariant Feature Transform (SIFT)). Several methods to combine low-level features and to set mid-level parameters have been evaluated recently for image classification. In this chapter, we study in detail the different components of the BoW model in the context of image classification. Particularly, we focus on the coding and pooling steps and investigate the impact of the main parameters of the BoW pipeline. We show that an adequate combination of several low (sampling rate, multiscale) and mid-level (codebook size, normalization) parameters is decisive to reach good performances. Based on this analysis, we propose a merging scheme that exploits the specificities of edge-based descriptors. Low and high contrast regions are pooled separately and combined to provide a powerful representation of images. We study the impact on classification performance of the contrast threshold that determines whether a SIFT descriptor corresponds to a low contrast region or a high contrast region. Successful experiments are provided on the Caltech-101 and Scene-15 datasets.

Cite

CITATION STYLE

APA

Law, M. T., Thome, N., & Cord, M. (2014). Bag-of-words image representation: Key ideas and further insight. In Advances in Computer Vision and Pattern Recognition (Vol. 64, pp. 29–52). Springer-Verlag London Ltd. https://doi.org/10.1007/978-3-319-05696-8_2

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free