Spatial Latent Dirichlet Allocation
Abstract
In recent years, the language model Latent Dirichlet Allocation (LDA), which clusters co-occurring words into topics, has been widely applied in the computer vision field. However, many of these applications have difficulty with modeling the spatial and temporal structure among visual words, since LDA assumes that a document is a bag-of-words. It is also critical to properly design words and documents when using a language model to solve vision problems. In this paper, we propose a topic model Spatial Latent Dirichlet Allocation (SLDA), which better encodes spatial structure among visual words that are essential for solving many vision problems. The spatial information is not encoded in the value of visual words but in the design of documents. Instead of knowing the partition of words into documents a priori, the word-document assignment becomes a random hidden variable in SLDA. There is a generative procedure, where knowledge of spatial structure can be flexibly added as a prior, grouping visual words which are close in space into the same document. We use SLDA to discover objects from a collection of images, and show it achieves better performance than LDA.
Spatial Latent Dirichlet Allocation
Xiaogang Wang and Eric Grimson
Computer Science and Computer Science and Artificial Intelligence Lab
Massachusetts Tnstitute of Technology, Cambridge, MA, 02139, USA
xgwang@csail.mit.edu, welg@csail.mit.edu
Abstract
In recent years, the language model Latent Dirichlet Allocation (LDA), which
clusters co-occurring words into topics, has been widely appled in the computer
vision field. However, many of these applications have difficulty with modeling
the spatial and temporal structure among visual words, since LDA assumes that a
document is a “bag-of-words”. It is also critical to properly design “words” and
“documents” when using a language model to solve vision problems. In this pa-
per, we propose a topic model Spatial Latent Dirichlet Allocation (SLDA), which
better encodes spatial structure among visual words that are essential for solving
many vision problems. The spatial information is not encoded in the value of
visual words but in the design of documents. Instead of knowing the partition of
words into documents a priori, the word-document assignment becomes a random
hidden variable in SLDA. There is a generative procedure, where knowledge of
spatial structure can be flexibly added as a prior, grouping visual words which are
close in space into the same document. We use SLDA to discover objects from a
collection of images, and show it achieves better performance than LDA.
1 Introduction
Latent Dirichlet Allocation (LDA) [1] is a language model which clusters co-occurring words into
topics. In recent years, LDA has been widely used to solve computer vision problems. For example,
LDA was used to discover objects from a collection of images [2, 3] and to classify images into
different scene categories [4]. [5] employed LDA to classify human actions. In visual surveillance,
LDA was used to model atomic activities and interactions happening in a crowded and busy scene
[6]. In these applications, LDA clustered low-level visual words (which were image patches, spatial
and temporal interest points or moving pixels) into topics with semantic meanings (which corre-
sponded to objects, parts of objects, human actions or atomic activities) utilizing their co-occurrence
information.
Even with these promising achievements, however, directly borrowing a language model to solve
vision problems has some difficulties. First, LDA assumes that a document is a bag of words,
such that spatial and temporal structure among visual words, which are meaningless in a language
model but important in many computer vision problems, are ignored. Second, users need to define
the meaning of “documents” in vision problems. The design of documents often implies some
assumptions on vision problems. For example, in order to cluster image patches, which are treated
as words, into classes of objects, researchers treated images as documents [2]. This assumes that
if two types of patches are from the same object class, they often appear in the same images. This
assumption is reasonable, but not strong enough. As an example shown in Figure 1, even though
the sky is far from the vehicles, if they often exist in the same images in the data set, they would be
clustered into the same topic by LDA. Furthermore, since in this image most of the patches are sky
and building, a patch on a vehicle is likely to be labeled as building or sky as well. These problems
1
when using LDA to discover classes of objects.
could be solved if the document of a patch, such as the yellow patch in Figure 1, only includes other
patches falling within its neighborhood, marked by the red dashed window in Figure 1, instead of
the whole image. So a better assumption is that if two types of image patches are from the same
object class, they are not only often in the same images but also close in space. We expect to utilize
spatial information in a flexible way when designing documents for solving vision problems.
In this paper, we propose a Spatial Latent Dirichlet Allocation (SLDA) model which encodes the
spatial structure among visual words. It clusters visual words (e.g. an eye patch and a nose patch),
which often occur in the same images and are close in space, into one topic (e.g. face). This is a more
proper assumption for solving many vision problems when images often contain several objects. It
is also easy for SLDA to model activities and human actions by encoding temporal information.
However the spatial or temporal information is not encoded in the value of visual words, but in the
design of documents. LDA and its extensions, such as the author-topic model [7], the dynamic
topic model [8], and the correlated topic model [9], all assume that the partition of words into
documents is known a priori. A key difference of SLDA is that the word-document assignment
becomes a hidden random variable. There is a generative procedure to assign words to documents.
When visual words are close in space or time, they have a high probability to be grouped into the
same document. Some approaches such as [10] could also capture the spatial structure among visual
words. This approach assumed that the spatial distribution of an object class could be modeled as
Gaussian and needed to know the number of objects in the image. However, usually we do not
know the number of objects in the images. It is also hard to model the spatial distributions of many
objects, such sky, trees, grass, crowds of pedestrians, or a flock of cows. In our model, there are
no such assumptions. Objects could be anywhere in the image and in arbitrary shape. The spatial
information is encoded when generating documents but not in the models of object classes.
As an example application, we use the SLDA model to discover objects from a collection of images.
As shown in Figure 2, there are different classes of objects, such as cows, cars, faces, grasses, sky,
bicycles, etc., in the image set. And an image usually contains several objects of different classes.
The goal is to segment objects from images, and at the same time, to label these segments as different
object classes in an unsupervised way. This integrates object segmentation and recognition. In our
approach images are divided into local patches. A local descriptor is computed for each image patch
and quantized into a visual word. Using topic models, the visual words are clustered into topics
which correspond to object classes. Thus an image patch can be labeled as one of the object classes.
Our work is related to [2] which used LDA to cluster image patches. As shown in Figure 2, SLDA
achieves much better performance than LDA. We will compare more results of LDA and SLDA in
the experimental section.
2 Computation of Visual Words
To obtain the local descriptors, images are convolved with the filter bank proposed in [11], which is
a combination of 3 Gaussians, 4 Laplacian of Gaussians, and 4 first order derivatives of Gaussians,
and was shown to have good performance for object categorization. Instead of only computing
visual words at interest points as in [2], we divide an image into local patches on a grid and densely
sample a local descriptor for each patch. A codebook of size W is created by clustering all the
local descriptors in the image set using K-means. Each local patch is quantized into a visual word
2
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


