Recognizing and Localizing Individual Activities through Graph Matching
2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance (2010)
- ISBN: 9781424483105
- DOI: 10.1109/AVSS.2010.81
Available from
Christian Wolf's profile on Mendeley.
or
Page 1
Recognizing and Localizing Individual Activities through Graph Matching
Recognizing and localizing individual activities through graph matching
Anh-Phuong Ta Christian Wolf Guillaume Lavoue´ Atilla Baskurt
Universite´ de Lyon, CNRS
INSA-Lyon, LIRIS, UMR5205, F-69621, France
fanh-phuong.ta,christian.wolf,glavoue,atilla.baskurtg@liris.cnrs.fr
Abstract
In this paper we tackle the problem of detecting individual
human actions in video sequences. While the most success-
ful methods are based on local features, which proved that
they can deal with changes in background, scale and illu-
mination, most existing methods have two main shortcom-
ings: first, they are mainly based on the individual power of
spatio-temporal interest points (STIP), and therefore ignore
the spatio-temporal relationships between them. Second,
these methods mainly focus on direct classification tech-
niques to classify the human activities, as opposed to de-
tection and localization. In order to overcome these limi-
tations, we propose a new approach, which is based on a
graph matching algorithm for activity recognition. In con-
trast to most previous methods which classify entire video
sequences, we design a video matching method from two
sets of ST-points for human activity recognition. First,
points are extracted, and a hyper graphs are constructed
from them, i.e. graphs with edges involving more than 2
nodes (3 in our case). The activity recognition problem
is then transformed into a problem of finding instances of
model graphs in the scene graph. By matching local fea-
tures instead of classifying entire sequences, our method
is able to detect multiple different activities which occur
simultaneously in a video sequence. Experiments on two
standard datasets demonstrate that our method is compara-
ble to the existing techniques on classification, and that it
can, additionally, detect and localize activities.
1. Introduction
Human action recognition has been an active research area
in recent years due to its wide number of applications
which include video-surveillance but also annotation and
retrieval, human computer interaction etc. At this time,
building a robust activity recognition system still remains a
very challenging task, because of the variations in actions
classes, different possible viewpoints, as well as illumi-
nation changes, moving cameras, complex dynamic back-
grounds and occlusions.
Based on the features used for recognition, existing ac-
tion recognition methods can be broadly divided into two
categories: local approaches [4, 14, 16, 18] and holistic
approaches [12, 23, 22] and some methods which do not
neatly fall into these categories, e.g. Sun et al. [20] combine
local and holistic features. Most of the holistic-based ap-
proaches rely on pre-processing of input data such as back-
ground subtraction or tracking. The local-based approaches
overcome some limitations by exploiting robust descriptors
extracted from interest points. Most of these methods are
based on bag-of-words models (BoW), which have been
very successful for text analysis, information retrieval and
image classification. Inspired by this, a number of works
have shown very good results for human action recognition
[4, 14, 16]. However, they discard the spatio-temporal lay-
out of the local features which may be almost as important
as the features themselves.
To overcome the limitations of the BoW models, efforts
have been made to exploit information from the spatial and
temporal distribution of interest points [13, 26]. These ex-
tensions, however, still suffer from some of the inherent
problems involved in classification: they do not allow to
localize activities, and they require selecting the optimal
number of codewords for codebook formation as well as
fine-tuning of parameters.
As a response, matching techniques have been intro-
duced recently, e.g [9, 19, 15]. Shechtman and Irani [19]
define a motion consistency measure to match space-time
volumes directly. However, the distance between pair of
videos is computed by exhaustively comparing patches ex-
tracted from every space-time point. Ke et al. [9] com-
bine a part-based shape and flow matching framework from
[19] for event detection in crowded videos. Recently, Ryoo
and Aggarwal [15] have presented a histogram-based match
kernel for video matching. Among the methods mentioned
above, our approach is most closely related to the work
of Ryoo and Aggarwal [15], who perform video matching
from two sets of ST-points. Our method differs from their
work in two main points. First, the authors in [15] define
1
Anh-Phuong Ta Christian Wolf Guillaume Lavoue´ Atilla Baskurt
Universite´ de Lyon, CNRS
INSA-Lyon, LIRIS, UMR5205, F-69621, France
fanh-phuong.ta,christian.wolf,glavoue,atilla.baskurtg@liris.cnrs.fr
Abstract
In this paper we tackle the problem of detecting individual
human actions in video sequences. While the most success-
ful methods are based on local features, which proved that
they can deal with changes in background, scale and illu-
mination, most existing methods have two main shortcom-
ings: first, they are mainly based on the individual power of
spatio-temporal interest points (STIP), and therefore ignore
the spatio-temporal relationships between them. Second,
these methods mainly focus on direct classification tech-
niques to classify the human activities, as opposed to de-
tection and localization. In order to overcome these limi-
tations, we propose a new approach, which is based on a
graph matching algorithm for activity recognition. In con-
trast to most previous methods which classify entire video
sequences, we design a video matching method from two
sets of ST-points for human activity recognition. First,
points are extracted, and a hyper graphs are constructed
from them, i.e. graphs with edges involving more than 2
nodes (3 in our case). The activity recognition problem
is then transformed into a problem of finding instances of
model graphs in the scene graph. By matching local fea-
tures instead of classifying entire sequences, our method
is able to detect multiple different activities which occur
simultaneously in a video sequence. Experiments on two
standard datasets demonstrate that our method is compara-
ble to the existing techniques on classification, and that it
can, additionally, detect and localize activities.
1. Introduction
Human action recognition has been an active research area
in recent years due to its wide number of applications
which include video-surveillance but also annotation and
retrieval, human computer interaction etc. At this time,
building a robust activity recognition system still remains a
very challenging task, because of the variations in actions
classes, different possible viewpoints, as well as illumi-
nation changes, moving cameras, complex dynamic back-
grounds and occlusions.
Based on the features used for recognition, existing ac-
tion recognition methods can be broadly divided into two
categories: local approaches [4, 14, 16, 18] and holistic
approaches [12, 23, 22] and some methods which do not
neatly fall into these categories, e.g. Sun et al. [20] combine
local and holistic features. Most of the holistic-based ap-
proaches rely on pre-processing of input data such as back-
ground subtraction or tracking. The local-based approaches
overcome some limitations by exploiting robust descriptors
extracted from interest points. Most of these methods are
based on bag-of-words models (BoW), which have been
very successful for text analysis, information retrieval and
image classification. Inspired by this, a number of works
have shown very good results for human action recognition
[4, 14, 16]. However, they discard the spatio-temporal lay-
out of the local features which may be almost as important
as the features themselves.
To overcome the limitations of the BoW models, efforts
have been made to exploit information from the spatial and
temporal distribution of interest points [13, 26]. These ex-
tensions, however, still suffer from some of the inherent
problems involved in classification: they do not allow to
localize activities, and they require selecting the optimal
number of codewords for codebook formation as well as
fine-tuning of parameters.
As a response, matching techniques have been intro-
duced recently, e.g [9, 19, 15]. Shechtman and Irani [19]
define a motion consistency measure to match space-time
volumes directly. However, the distance between pair of
videos is computed by exhaustively comparing patches ex-
tracted from every space-time point. Ke et al. [9] com-
bine a part-based shape and flow matching framework from
[19] for event detection in crowded videos. Recently, Ryoo
and Aggarwal [15] have presented a histogram-based match
kernel for video matching. Among the methods mentioned
above, our approach is most closely related to the work
of Ryoo and Aggarwal [15], who perform video matching
from two sets of ST-points. Our method differs from their
work in two main points. First, the authors in [15] define
1
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
7 Readers on Mendeley
by Discipline
by Academic Status
57% Ph.D. Student
14% Student (Bachelor)
14% Researcher (at a non-Academic Institution)
by Country
43% France
14% United Kingdom
14% Turkey


