Sign up & Download
Sign in

A Neural-Based Deep Model for Human Action Recognition

by Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, Baskurt Atilla
International Workshop on Human Behavior Understanding (2011)

Cite this document (BETA)

Available from Christian Wolf's profile on Mendeley.
Page 1
hidden

A Neural-Based Deep Model for Human Action Recognition

A Neural-Based Deep Model for Human Action
Recognition
Moez Baccouche1, Franck Mamalet1, Christian Wolf2, and Christophe Garcia2
1 Orange Labs, 4 rue du Clos Courtel, 35510 Cesson-Sevigne, France.
{firstname.surname}@orange-ftgroup.com
2 LIRIS, UMR 5205 CNRS, INSA-Lyon, F-69621, France.
{firstname.surname}@liris.cnrs.fr
Abstract.
Keywords: Human action recognition, deep models, 3D convolutional neural
networks, long short-term memory, KTH human actions dataset.
1 Introduction and Related Work
Automatic understanding of human behaviour and its interaction with his envi-
ronment have been an active research area in the last years due to its application
in a variety of domains. To achieve such challenging task, several research elds
are involved to model human behaviour under its multiple facets (emotions,
relational attitudes, actions, etc.). In this context, recognizing \what a person
is doing" appears to be a crucial part when interpreting complex behavioural
patterns. Thus, a great interest has been granted to human action recognition,
especially in real-world environment.
Among the most popular state-of-the-art methods for human action recogni-
tion, we can mention those proposed by Laptev et al. [13] and Dollar et al. [3].
Although some di erences, the common point of all these methods is the fact
that they use engineered motion and texture descriptors extracted around spatio-
temporal interest points. Despite their high performances, these approaches suf-
fer from their reliability on hand-crafted features, which makes them highly
problem-dependant.
In contrast with this dominant methodology, there has been a growing inter-
est in approaches that can learn multiple layers of feature hierarchies and auto-
matically build increasingly high-level representations of the raw input. These
deep models are thereby more generic since the feature construction process is
fully automated. One of the most used deep models is the Convolutional Neural
Network architecture [14], hereafter ConvNets, which is a bioinspired hierarchi-
cal multilayered neural network able to learn visual patterns directly from the
image pixels without any pre-processing step. However, even if ConvNets were
shown to yield competitive performance in many image processing tasks (some-
times outperforming all existing methods), their extension to the video case is a
Page 2
hidden
2 M. Baccouche, F. Mamalet, C. Wolf and C. Garcia
still open issue, and the few attempts either make no use of the motion informa-
tion [17], or operate on hand-crafted inputs (spatio-temporal outer boundaries
volume in [11], hand-wired combination of multiple input channels in [10]). In
addition, since these models take as input a small number of consecutive frames
(typically less than 15), they are trained to assign a vector of features (and a
label) to short sub-sequences and not to the entire sequence. The nal label
is generally obtained by voting. Thus, an important part of the temporal in-
formation contained in the sequence is not exploited for classi cation. Indeed,
even if the learned features, taken individually, contains temporal information,
their evolution over time is completely ignored. Though, we have shown in our
previous work [1] that such information have powerful discriminative abilities
between actions, and is particularly usable by a category of learning machines,
adapted to sequential data, namely Long Short-Term Memory recurrent neural
networks (LSTM) [6].
In this paper, we propose a 2-steps neural-based deep model for human action
recognition. The rst part of the model, based on the extension of ConvNets to
3D case, learns to extract spatio-temporal features rather than hand-crafting
them. Then, the second step consists on using these learned features to train
a recurrent neural network model to classify the entire sequence. We evaluate
the performances on the KTH dataset [19], taking particular care to follow the
evaluation protocol recommendations in [4]. We show that, without using the
LSTM classi er, we obtain comparable results with other deep models based
approaches [9, 21], and especially the 3D-ConvNets approach proposed in [10]
using a 15 times smaller model. We also show that the introduction of the LSTM
classi cation leads to signi cant performance improvement, reaching average
accuracies among the best of related work.
The rest of the paper is organized as following. Section 2 outlines some Con-
vNets fundamentals and the feature learning process. Then, we present is section
3 the used recurrent neural sequence labelling scheme. Finally, experimental re-
sults, carried out on the KTH dataset [19], will be presented in section 4.
2 Deep Learning of Spatio-Temporal Features
In this section, we describe the rst part of our neural recognition scheme. Based
on the extension of convolutional neural networks (ConvNets) to 3D, the pro-
posed neural architecture learns to extract spatio-temporal features which are
\optimal" for the targeting recognition task. We rst present some fundamentals
of deep models and 2D-ConvNets, and then discuss their extension to 3D and
describe our proposed architecture.
2.1 Convolutional Neural Networks (ConvNets)
Despite their generic nature, deep models were not used in many applications
until the late nineties because of their inability to treat \real world" data. In-
deed, early deep architectures dealt only with 1-D data or small 2D-patches. The
Page 3
hidden
A Neural-Based Deep Model for Human Action Recognition 3
main problem was that the input was \fully connected" to the model, and thus
the number of free parameters was directly related to the input dimension, mak-
ing these approaches inappropriate to handle \pictoral" inputs (natural images,
videos...).
Therefore, the convolutional architecture was introduced by LeCun et al. [14]
to alleviate this problem. In someway, ConvNets are the adaptation of multilay-
ered neural deep architectures to deal with real world data. This is done by the
use of local receptive elds which sweeps the input, and which parameters are
forced to be identical for all its possible locations, a principle called weight shar-
ing. Schematically, LeCun's ConvNet architecture [14] is a succession of layers
alternating 2D-convolutions (to capture salient information) and sub-samplings
(to reduce dimension), both with trainable weights.
In the next sub-section, we examine the adaptation of ConvNets to video
processing, and describe the 3D-ConvNets architecture that we used in our ex-
periments on KTH dataset [19].
2.2 Automated Space-time Feature Construction with 3D-ConvNets
As mentioned above, there have been some attempts to adapt ConvNets to
video processing [11, 10]. The extension from 2D to 3D in terms of architecture
is straightforward since 2D convolutions are simply replaced by 3D ones, to
handle video inputs. Our proposed architecture also uses 3D convolutions, but
is di erent from [11] and [10] in the fact that it uses only raw inputs. We have
tested a variety of 3D-ConvNets architectures and we retained the one illustrated
on Figure. 1.
Fig. 1. Used 3D-ConvNet architecture for spatio-temporal features construction on
KTH dataset [19]. Architectural details are given in the text.
Page 4
hidden
4 M. Baccouche, F. Mamalet, C. Wolf and C. Garcia
This architecture consists of 10 layers including the input one. There are two
alternating convolutional, recti cation and sub-sampling layers C1, R1, S1 and
C2, R2, S2 followed by a third convolution layer C3 and two neuron layers N1
and N2. The size of the 3D input layer is 34549, corresponding to 9 successive
frames of 34  54 pixels each. Layer C1 is composed of 7 feature maps of size
28 48 5 pixels. Each unit in each feature map is connected to a 3D 7 7 5
neighborhood into the input retina. Layer R1 is composed of 7 feature maps,
each connected to one feature map in C1, and simply applies absolute value
to its input. Recti cation layers are combined to local contrast normalization
of inputs, this was shown to signi cantly improve performances, as observed
in [8] in object recognition tasks. Layer S1 is composed of 7 feature maps of
size 14  24  5, each connected to one feature map in R1. S1 performs sub-
sampling at a factor of 2 in spatial domain, aiming to build robustness to small
spatial distorsions. The connection scheme between layers S1 and C2 follows
the same principle described in [5]. Consequently, layer C2 contains 35 feature
maps performing 5  5  3 convolutions. Layers R2 and S2 follows the same
principle described above for R1 and S1. Finally, layer C3 consists of 5 feature
maps fully-connected to S2 and performing 333 convolutions. At this stage,
each C3 feature map contains 381 values, and thus, the input information is
encoded in a vector of size 120. This vector can be interpreted as a descriptor of
the salient spatio-temporal information extracted from the 34549 input, and
will be used in section 3 to classify the action sequences. Finally, layers N1 and
N2 contain classical neural units. The output layer consists of the same number
of units as the number of actions. This architecture corresponds to a total of
17; 169 trainable parameters (which is about 15 times less than the architecture
used in [10]). To Train this model, we used the algorithm proposed in [14], which
is the standard online Backpropagation with momentum algorithm, adpated to
weight sharing.
Fig. 2. A subset of 3 automatically constructed C1 feature maps (of 7 total), each
one corresponding, from left to right, to the actions walking, boxing, hand-claping and
hand-waving from the KTH dataset [19].
Once the 3D-ConvNet is trained on KTH actions, and since the spatio-
temporal feature construction process is fully automated, it's interesting to ex-
amine if the learned features are visually interpretable. We report on Figure. 2
a subset of learned C1 feature maps, corresponding each to some actions from
the KTH dataset [19]. From left to right, the rst feature map seems to segment
Page 5
hidden
A Neural-Based Deep Model for Human Action Recognition 5
person silhouette from background. The second one appears to capture the mo-
tion of limbs which are sought when performing the action: arms for boxing and
hand-waving, hands for hand-clapping and legs for walking. The last feature map
is likely to encode edges content. Other (not illustrated) feature maps bear some
resemblance to motion history images. Anyway, even if nding a direct link be-
tween automatically learned features and engineered ones is not straightforward,
the learned feature maps are visually relevant.
In the next section, we describe how these features are used to feed a recurrent
neural network classi er, which is trained to recognize the actions based on
the temporal evolution of features. However, it is interesting to note that the
proposed 3D-ConvNets architecture, combined to majority voting on the short
sub-sequences, has comparable performances to state-of-the-art proposed deep
models (cf. Section. 4).
3 Sequence Labelling Considering the Temporal
Evolution of Learned Features
Fig. 3. An overview of our two-steps neural recognition scheme.
Once the features are automatically constructed with the 3D-ConvNet ar-
chitecture as described in Section. 2, the next step is to use them to recognize
the actions. The idea is to learn to label the entire sequence based on the accu-
mulation of several individual decisions corresponding each to a small temporal
neighbourhood which was involved during the 3D-ConvNets learning process
(see Figure. 3). This allows to take advantage of the temporal evolution of the
features, in comparison with the majority voting process on the individual deci-
sions.
Among state of the art learning machines, Recurrent Neural Networks (RNN)
are one of the most used for temporal analysis of data, because of their- ability
to consider the context. This can be done by the use of recurrent connections
in the hidden layers. Nevertheless, even if they are able to learn tasks which
involve short time lags between inputs and corresponding teacher signals, this
short-term memory becomes insucient when dealing with \real world" sequence
processing, e.g video sequences. In order to alleviate this problem, Schmidhuber
Page 6
hidden
6 M. Baccouche, F. Mamalet, C. Wolf and C. Garcia
et al. [6] proposed a particular recurrent architecture, namely Long Short-Term
Memory (LSTM). This is achieved by adding a special node, namely constant
error carousel (CEC), that allows for constant error signal propagation through
time. The second key idea is the use of multiplicative gates to control the access
to the CEC. LSTM have been tested in many applications and generally out-
performed existant methods. Furthermore, we have shown in our previous work
[1] that LSTM-based recurrent neural networks are especially ecient to label
sequences of descriptors corresponding to hand-crafted features.
In our experiments, we used a recurrent neural network architecture with
one hidden layer of LSTM cells. The input layer has the same size than the
output of 3D-ConvNet's layer C3, i.e a vector of size 120 per timestep. LSTM
cells are fully connected to the rest of the network, and are auto-recurrent. We
have tested several con guration of networks, varying the number of hidden
LSTM, and veri ed that a large number of cells leads to over tting, and the
opposite leads to divergence. Thus, a con guration of 50 LSTM was found to
be a good compromise for this classi cation task. This architecture corresponds
to about 25; 000 trainable parameters. The network was trained with online
backpropagation through time with momentum [6].
4 Experiments on KTH Dataset
4.1 KTH Human Actions Dataset
The KTH dataset was provided by Schuldt et al. [19] in 2004 and is the most
commonly used public human actions dataset. It contains 6 types of actions (box-
ing, hand-clapping, hand-waving, jogging, running and walking) performed by
25 subjects in 4 di erent scenarios including indoor, outdoor, changes in clothing
and variations in scale. The image resolution is of 160 120, and temporal reso-
lution is of 25 frames per second. There are considerable variations in duration
and viewpoint. All sequences were taken over homogeneous backgrounds with
a static camera, but hard shadows are present. As in [4], we rename the KTH
datast in two ways: the rst one is that some person performs the same action 3
or 4 times in the same video, is named KTH1 and contains 599 long sequences
(with a length between 8 and 59 seconds) with several \empty" frames between
action iterations. The second is that a person does an action only one time,
is named KTH2 and contains 2391 sequences (with a length between 1 and 14
seconds).
4.2 Evaluation Protocol
In [4], Gao et al. presented a comprehensive study on the in
uence of the used
evaluation protocol on the nal results. It was shown that the use of di er-
ent experimental con gurations can lead to performance di erences up to 9%.
Furthermore, authors demonstrated that the same method, when evaluated on
KTH1 or KTH2 can have over 5:85% performance deviations. Even so, action
Page 7
hidden
A Neural-Based Deep Model for Human Action Recognition 7
recognition methods are usually directly compared although they use di erent
testing protocols or/and datasets (KTH1 or KTH2), which distorts the conclu-
sions. In this paper, we choose to evaluate our method using cross-validation,
in which 16 randomly-selected persons are used for training, and the other 9
for testing. Recognition performance corresponds to the average across 5 trials.
Evaluation is done on both KTH1 and KTH2. We have paid particular attention
to identify which related works are directly comparable to ours (see Subsection.
4.3), based on the categorization in [4]. Conlusions will be drawn according only
to these reliable comparisons, and the other reported results are simply indica-
tive.
4.3 Obtained Results
The 2-steps model was trained as described above. Original videos underwent
the following steps: spatial down-sampling by a factor of 2 horizontally and verti-
cally to reduce the memory requirement, extracting the person-centred bounding
box as in [9, 10], and applying 3D Local Contrast Normalization on a 7  7  7
neighbourhood, as recommanded in [8]. Note that these steps may be considered
as a pre-processing of the input, with a di erence that they were not especially
designed to help action recognition and still much less complex than those usu-
ally used (optical
ow, gradients, motion history...).We also generated vertically
iped and mirror versions of each training sample to increase examples num-
ber. Learned features were then used to train a LSTM-based recurrent network,
as described in Section. 3. In our experiments, we observed that, both for 3D-
ConvNets and LSTM, using a subset of the training dataset for validation in
the early stopping process decreases performances. Indeed, we experimentally
veri ed that performances are more connected to the training examples number
than the stopping criteria, and that the shortfall due to over tting is not impor-
tant. Thus, the training is stopped at the rst iteration in which performances
on the training set no longer rise. Obtained results, corresponding to 5 randomly
selected training/test con gurations are reported on Table. 1.
Table 1. Summary of experimental results using 5 randomly selected con gurations
from KTH1 and KTH2.
Con g.1 Con g.2 Con g.3 Con g.4 Con g.5 Average
KTH1 Maj. voting 90.79 90.24 91.42 91.17 91.62 91.04
LSTM 92.69 96.55 94.25 93.55 94.93 94.39
KTH2 Maj. voting 89.14 88.55 89.89 89.45 89.97 89.40
LSTM 91.50 94.64 90.47 91.31 92.97 92.17
The two steps of our model (3D-ConvNets and LSTM) are evaluated sep-
arately, on both KTH1 and KTH2. The 3D-ConvNet, combined to majority
voting on short sub-sequences gives comparable results to other deep model
based approaches [9, 10, 21]. We especially note that results are almost the same
Page 8
hidden
8 M. Baccouche, F. Mamalet, C. Wolf and C. Garcia
than those obtained in [10], with a 15 times smaller 3D-ConvNet model, and
without using neither gradients nor optical
ow as input. We also notice that
the rst step of our model gives relatively stable results on the 5 con gurations,
comparing to the
uctuations generally observed for the other methods [4].
We also observed that the LSTM contribution is a little higher for KTH1
(+3:35%) than for KTH2 (+2:77%). This is probably because the LSTM is more
suited for long sequences, but also to the fact that KTH1 contains more temporal
infomration (alternating the direction of movement when performing the same
action severall times, appearance/disappearance of the person...). Thus, LSTM
sequence labelling achieves an overall accuracy of 94:39% on KTH1 and 92:17%
on KTH2. These results, and others among the best performing of related work
on KTH dataset, are reported on Table. 2.
Table 2. Obtained results and comparison with state-of-the-art on KTH dataset: meth-
ods reported in bold corresponds to deep models approaches, and the others to those
using hand-crafted features.
Dataset Evaluation Protocol Method Average Accuracy
Our method 94.39
cross validation Jhuang et al. [9] 91.70
with 5 runs Schindler and Gool [18] 92.70
Gao et al. [4] 95.04
KTH1 Niebles et al. [16] 81.50
Chen and Hauptmann [2] 95.83
leave-one-out Liu and Shah [15] 94.20
Sun et al. [20] 94.0
Gao et al. [4] 96.33
cross Our method 92.17
validation Ji et al. [10] 90.20
with 5 runs Gao et al. [4] 93.57
KTH2 Taylor et al. [21] 90.00
other Kim et al. [12] 95.33
protocols Laptev et al. [13] 91.80
Ikizler et al. [7] 94.00
Table. 2 shows that our approach outperforms all related work deep models
[9, 10, 21], both on KTH1 and KTH2. We especially noticed that our recogni-
tion scheme even outperforms the HMAX model, proposed by Jhaung et al. [9]
although its hybrid nature, since low and mid level features are engineered and
learned ones are constructed automatically at the very last stage.
For each dataset, Table. 2 is divided into 2 groups: the rst group consists of
the methods which can be directly compared with ours, i.e those using the same
evaluation protocol (which is cross validation with 5 randomly selected splits of
the dataset into training and test). The second one includes the methods that
use di erent protocols, and therefore those for whom the comparison is only
indicative. Among the methods of the rst group, to our knowledge, our method
Page 9
hidden
A Neural-Based Deep Model for Human Action Recognition 9
obtained the second best accuracy, both on KTH1 and KTH2, the best score
being obtained by Gao et al. [4]. Note that the results in [4] corresponds to
the average on the 5 best runs over 30 total, and that these classi cation rates
decreases to 90:93% for KTH1 and 88:49% for KTH2 if averaging on the 5 worst
ones.
More generally, our method gives comparable results with the best related
work on KTH dataset, even methods relying on engineered features, and those
evaluated using protocols which was shown to outstandingly increase perfor-
mances (e.g leave-one-out). This is a surprisingly good result considering the
fact that all the steps of our model are based on automatic learning, without the
use of any prior knowledge.
5 Conclusion and Discussion
References
1. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Action classi ca-
tion in soccer videos with long short-term memory recurrent neural networks. In:
Diamantaras, K., Duch, W., Iliadis, L. (eds.) Arti cial Neural Networks, Lecture
Notes in Computer Science, vol. 6353, pp. 154{159. Springer Berlin / Heidelberg
(2010)
2. Chen, M.y., Hauptmann, A.: Mosift: Recognizing human actions in. surveillance
videos. Tech. Rep. CMU-CS-09-161, Carnegie Mellon University (september 2009)
3. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse
spatio-temporal features. In: Visual Surveillance and Performance Evaluation of
Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on. pp.
65 { 72 (oct 2005)
4. Gao, Z., Chen, M.y., Hauptmann, A., Cai, A.: Comparing evaluation protocols on
the kth dataset. In: Salah, A., Gevers, T., Sebe, N., Vinciarelli, A. (eds.) Human
Behavior Understanding, Lecture Notes in Computer Science, vol. 6219, pp. 88{
100. Springer Berlin / Heidelberg (2010)
5. Garcia, C., Delakis, M.: Convolutional face nder: a neural architecture for fast
and robust face detection. Pattern Analysis and Machine Intelligence, IEEE Trans-
actions on 26(11), 1408 {1423 (nov 2004)
6. Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with lstm
recurrent networks. J. Mach. Learn. Res. 3, 115{143 (March 2003)
7. Ikizler, N., Cinbis, R., Duygulu, P.: Human action recognition with line and
ow
histograms. In: 19th International Conference on Pattern Recognition. pp. 1 {4
(dec 2008)
8. Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-
stage architecture for object recognition? In: Computer Vision, 2009 IEEE 12th
International Conference on. pp. 2146 {2153 (2009)
9. Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action
recognition. In: 11th International Conference on Computer Vision. pp. 1{8 (2007)
10. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human
action recognition. In: 27th International Conference on Machine Learning. pp.
495{502 (2010)
Page 10
hidden
10 M. Baccouche, F. Mamalet, C. Wolf and C. Garcia
11. Kim, H.J., Lee, J., Yang, H.S.: Human action recognition using a modi ed con-
volutional neural network. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.)
Advances in Neural Networks, Lecture Notes in Computer Science, vol. 4492, pp.
715{723. Springer Berlin / Heidelberg (2007)
12. Kim, T.K., Wong, S.F., Cipolla, R.: Tensor canonical correlation analysis for action
classi cation. In: Computer Vision and Pattern Recognition. pp. 1 {8 (june 2007)
13. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human
actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR
2008. IEEE Conference on. pp. 1 {8 (june 2008)
14. Lecun, Y., Bottou, L., Bengio, Y., Ha ner, P.: Gradient-based learning applied to
document recognition. Proceedings of the IEEE 86(11), 2278 {2324 (nov 1998)
15. Liu, J., Shah, M.: Learning human actions via information maximization. In: Com-
puter Vision and Pattern Recognition. pp. 1 {8 (june 2008)
16. Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories
using spatial-temporal words. International Journal of Computer Vision 79, 299{
318 (2008)
17. Ning, F., Delhomme, D., LeCun, Y., Piano, F., Bottou, L., Barbano, P.: Toward au-
tomatic phenotyping of developing embryos from videos. Image Processing, IEEE
Transactions on 14(9), 1360{1371 (2005)
18. Schindler, K., van Gool, L.: Action snippets: How many frames does human action
recognition require? In: Computer Vision and Pattern Recognition, 2008. CVPR
2008. IEEE Conference on. pp. 1 {8 (june 2008)
19. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm ap-
proach. In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th Inter-
national Conference on. vol. 3, pp. 32 { 36 Vol.3 (aug 2004)
20. Sun, X., Chen, M., Hauptmann, A.: Action recognition via local descriptors and
holistic features. In: Computer Vision and Pattern Recognition Workshops. pp. 58
{65 (june 2009)
21. Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-
temporal features. In: Proceedings of the 11th European conference on Computer
vision: Part VI. pp. 140{153. ECCV'10, Springer-Verlag, Berlin, Heidelberg (2010)

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

1 Reader on Mendeley
by Discipline
 
by Academic Status
 
100% Associate Professor
by Country
 
100% France