Learning to Recognize Agent Activities and Intentions
Abstract
Psychological research has demonstrated that subjects shown animations consisting of nothing more than simple geometric shapes perceive the shapes as being alive, having goals and intentions, and even engaging in social activities such as chasing and evading one another. While the subjects could not directly perceive affective state, motor commands, or the beliefs and intentions of the actors in the animations, they still used intentional language to describe the moving shapes. The purpose of this dissertation is to design, develop, and evaluate computational representations and learning algorithms that learn to recognize the behaviors of agents as they per- form and execute different activities. These activities take place within simulations, both 2D and 3D. Our goal is to add as little hand-crafted knowledge to the rep- resentation as possible and to produce algorithms that perform well over a variety of different activity types. Any patterns found in similar activities should be dis- covered by the learning algorithm and not by us, the designers. In addition, we demonstrate that if an artificial agent learns about activities through participation, where it has access to its own internal affective state, motor commands, etc., it can then infer the unobservable affective state of other agents.
Learning to Recognize Agent Activities and Intentions
INTENTIONS
by
Wesley Nathan Kerr
CC
BY:
C
Creative Commons 3.0 Attribution-Share Alike License
A Dissertation Submitted to the Faculty of the
DEPARTMENT OF COMPUTER SCIENCE
In Partial Fulllment of the Requirements
For the Degree of
DOCTOR OF PHILOSOPHY
In the Graduate College
THE UNIVERSITY OF ARIZONA
2010
GRADUATE COLLEGE
As members of the Dissertation Committee, we certify that we have read the dis-
sertation prepared by Wesley Nathan Kerr
entitled Learning to Recognize Agent Activities and Intentions
and recommend that it be accepted as fullling the dissertation requirement for the
Degree of Doctor of Philosophy.
Date: 10 August 2010
Paul R. Cohen
Date: 10 August 2010
Niall Adams
Date: 10 August 2010
Ian Fasel
Date: 10 August 2010
Stephen Kobourov
Final approval and acceptance of this dissertation is contingent upon the candidate's
submission of the nal copies of the dissertation to the Graduate College.
I hereby certify that I have read this dissertation prepared under my direction and
recommend that it be accepted as fullling the dissertation requirement.
Date: 10 August 2010
Dissertation Director: Paul R. Cohen
This dissertation has been submitted in partial fulllment of requirements for an
advanced degree at the University of Arizona and is deposited in the University
Library to be made available to borrowers under rules of the Library.
Brief quotations from this dissertation are allowable without special permission,
provided that accurate acknowledgment of source is made. This work is licensed
under the Creative Commons Attribution-Share Alike 3.0 United States License. To
view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/
or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco,
California, 94105, USA.
SIGNED: Wesley Nathan Kerr
I thoroughly enjoyed the time I spent in graduate school primarily because I worked
with exceptional people. There are many people that I would like to thank and
acknowledge for their support over the past few years.
First among them is my advisor, Paul Cohen. I cannot begin to describe how
incredible it was to work in Paul's lab these past ve years. I will simply say that
Paul is one of the most intelligent men I have ever met and it was a privilege to
have him as a mentor.
Special thanks goes to Niall Adams for the hard work he put in making this
dissertation what it is. He helped edit some of the earliest drafts and shaped some
of the nal experiments. Niall was a helpful mentor throughout my career and
genuinely concerned about making my research excellent.
Thanks goes to Stephen Kobourov and Ian Fasel for their contributions to my
dissertation and for being part of my committee. I enjoyed our conversations and
look forward to future collaborations.
Growing up, my grandmother would remind me that \behind every great man
there is a great woman." I do not claim to be a great man, but I can say that I have
a great woman. My wife Nicole had the hardest job of all since she had to live with
me while I was working on my dissertation. Her patience with me is admirable and
her unwavering support is truly commendable.
There are several people I would like to thank from my time at USC. First
among them are my oce mates, Shane Hoversten and Daniel Hewlett. I thoroughly
enjoyed our conversations, even though we did not always agree. I would also like
to thank you both for being friends and making me look back fondly at our time
together at USC. Down the hall from my oce were two other good friends who
would help bring a lighter side to my life as a graduate student. Many thanks to
Martin Michalowski and Matt Michelson for always being up for a game of FIFA
and a chance to relax.
I would like to thank the other graduate students from Paul's lab: Daniel
Hewlett, Derek Green, Antons Rebguns, Jeremy Wright, Nik Sharp, Anh Tran.
I will cherish the conversations that we shared in the lab and at the bar.
Finally, a special thanks goes out to Lupe Jacobo and Rhonda Leiva for your
hard work ensuring that I would nish my dissertation in a reasonable time frame.
Dedicated to Mom and Dad for their patience and support these last seven years.
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.1 Intention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
CHAPTER 2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1 Activity Recognition in Robots and Softbots . . . . . . . . . . . . . . 23
2.1.1 Intention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Computational Approaches . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.1 Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.2 Multivariate Time Series Classication . . . . . . . . . . . . . 28
2.2.3 Univariate Time Series Classication . . . . . . . . . . . . . . 30
2.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
CHAPTER 3 REPRESENTATION . . . . . . . . . . . . . . . . . . . . . . . 32
3.1 Qualitative Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Event Sequences . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 Relational Sequences . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Real-Valued Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Symbolic Conversion . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 Shape Conversion . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Wrapping Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
CHAPTER 4 LEARNING AND INFERENCE . . . . . . . . . . . . . . . . . 47
4.1 Sequence Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.1 Needleman-Wunsch Algorithm . . . . . . . . . . . . . . . . . . 49
4.1.2 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7
4.3 Visualizing Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Finite State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.1 Signatures for Generalization . . . . . . . . . . . . . . . . . . 61
4.5 Inferring Hidden State . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Wrapping Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
CHAPTER 5 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.1 Wubble World . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.2 Wubble World 2D . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.1 Wubble World . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.2 Wubble World 2D . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.4 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Heat Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5 Inferring Hidden State . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.6 Wrapping Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
CHAPTER 6 APPLICATIONS TO DATA MINING . . . . . . . . . . . . . 91
6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.1 Handwriting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.1.2 ECG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.3 Wafer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.4 Japanese Vowel . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.5 Sign Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2.1 k-NN Classier . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2.2 CAVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3 Wrapping Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
CHAPTER 7 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . 104
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
1.1 Heider and Simmel frame . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Sample PMTS for jump over . . . . . . . . . . . . . . . . . . . . . . . 18
1.3 Five examples of the activity approach . . . . . . . . . . . . . . . . . 19
2.1 Allen Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Diagram of common terms. . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Example sequences for each representation . . . . . . . . . . . . . . . 35
3.3 The bit array for an approach episode . . . . . . . . . . . . . . . . . . 36
3.4 Compression process for CBA representation . . . . . . . . . . . . . . 37
3.5 The CBA representation for an approach episode . . . . . . . . . . . . 37
3.6 Speed time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 The eects of smoothing. . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.8 Speed time series with SAX symbols . . . . . . . . . . . . . . . . . . 44
3.9 Speed time series as SDL symbols . . . . . . . . . . . . . . . . . . . . 45
4.1 An alignment between two sequences. . . . . . . . . . . . . . . . . . . 48
4.2 Sample heat map with marked heat indexes . . . . . . . . . . . . . . 57
4.3 Heat maps for an approach episode . . . . . . . . . . . . . . . . . . . 59
4.4 An example of the activity approach marked as sequence of states. . . 59
4.5 An example of the conversion from a CBA into a FSM. . . . . . . . . 61
4.6 The complete FSM for each approach episode in Figure 1.3. . . . . . . 62
4.7 Mapping from mutiple sequence alignment to original episode . . . . 65
4.8 A general FSM for the approach activities in Figure 1.3 . . . . . . . . 66
5.1 Screenshot of the Wubble World 2D simulator. . . . . . . . . . . . . . 71
5.2 K-Folds partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Learning curve of the CAVE algorithm generated by presenting la-
beled training instances one at a time to corresponding signatures. . . 80
5.4 Heat maps for jump over and jump on aligned with a signature for
jump over . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 The heat map for an approach episode aligned with the signature for
jump over. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6 Recognition results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1 Classication accuracy for each representation after shuing . . . . . 97
6.2 Dierence in performance for ablation study . . . . . . . . . . . . . . 99
9
6.3 Classication accuracy for each activity for dierent settings of the
exclusion percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
LIST OF TABLES
3.1 Fluent representation of an episode of approach. . . . . . . . . . . . . 34
3.2 SAX breakpoint lookup table . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Similarity table for sequence alignment . . . . . . . . . . . . . . . . . 51
4.2 The signature constructed from the rst four examples in Figure 1.3. 54
4.3 Signature update example . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Sample signatures for each sequence representation . . . . . . . . . . 56
4.5 Rewriting the signature as a multiple sequence alignment. . . . . . . 64
5.1 Wubble World dataset statistics . . . . . . . . . . . . . . . . . . . . . 71
5.2 Wubble World 2D dataset statistics . . . . . . . . . . . . . . . . . . . 74
5.3 Classication results for Wubble World . . . . . . . . . . . . . . . . . 76
5.4 Wubble World confusion matrix . . . . . . . . . . . . . . . . . . . . . 77
5.5 Classication results for Wubble World 2D . . . . . . . . . . . . . . . 78
5.6 Classication results after shuing sequences. . . . . . . . . . . . . . 79
5.7 Classication performance when some of the propositions are unob-
servable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.8 Inferred relations for the chase activity . . . . . . . . . . . . . . . . . 88
5.9 Overlap between most frequent unobservable relations from the ww
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.10 Overlap between most frequent unobservable relations in ww2d sig-
natures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.1 Handwriting dataset statistics . . . . . . . . . . . . . . . . . . . . . . 92
6.2 Eclectrocardigram dataset statistics . . . . . . . . . . . . . . . . . . . 93
6.3 Wafer dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4 Japanese vowel dataset statistics . . . . . . . . . . . . . . . . . . . . . 94
6.5 Auslan dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.6 Percent correct with a 10-NN classier from a 10-fold cross validation
classication task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.7 A two-way analysis of variance for dataset by representation shows
one main eect and no signicant interaction eect. . . . . . . . . . . 96
6.8 Best classication accuracy for several of our datasets. . . . . . . . . 96
6.9 Classication accuracy without SDL variables . . . . . . . . . . . . . 98
6.10 Classication accuracy without SAX variables . . . . . . . . . . . . . 99
6.11 A three-way analysis for the data shown in Figure 6.2. . . . . . . . . 100
occurred. Activities are perceived as discrete, with beginnings and ends, so although
people experience the world as a continuous
ow of information, humans parse and
extract distinct activities. Furthermore, the activities and their boundaries are very
similar across dierent individuals.
Previous research has demonstrated that subjects who are shown animations
consisting of nothing more than simple geometric shapes perceive these shapes as
alive, having goals and intentions, and even interacting in social relationships such as
chasing and evading (Heider and Simmel, 1944; Blythe et al., 1999). For example,
subjects in the classic Heider and Simmel study were shown a two minute video
containing nothing more than a circle, two triangles (one large and one small), and
a few rectangles, yet the subjects consistently labeled the larger triangle as a bully
who constantly chased and harassed the smaller triangle and circle (Heider and
Simmel, 1944). A single frame from the original study is replicated in Figure 1.1.
Figure 1.1: Single frame from a similar animation to the original Heider and Simmel
animation.
1.1.1 Intention
Baldwin and Baird (2001) argued that we care little for the behavior exhibited by
animate entities in motion, and we are most interested in the underlying intentions of
the animate entities. Imagine that one is born with the ability to recognize patterns
in movement, but one is unable to discern any purpose behind these patterns. For
this individual, there is no dierence between a playful shove from a friend or a shove
from someone wishing to cause harm to the individual. It is the shover's intentions
aective state, motor commands, or the beliefs and intentions of the actors or con-
trolling entities in the animations, yet they inferred aective states and described
them with intentional language.
We think humans infer aective states given non-aective observables such as po-
sitions and velocities by calling on their own aective experiences. Observables cue,
or cause to be retrieved from memory, representations of the activity that include
learned aective components, which are inferred or \lled in" as interpretations of
patterns of motion or other non-aective observables.
1.2 Problem
The problem addressed by this dissertation is to develop algorithms capable of
recognizing and classifying activities after only seeing a few episodes of the activity.
In addition, once the activity is correctly identied, there should be a mechanism
to infer the internal states of the agent performing the activity. For demonstrative
purposes, we focus on a single activity and highlight the dierent challenges that
we will face. We will be working with multivariate time series. A multivariate time
series (MTS) is a collection of random variables whose values are sampled over time
at the same intervals. Often, the values sampled in a MTS are real-valued, yet the
representations outlined in Chapter 3 operate on propositional data. We dene a
propositional multivariate time series (PMTS) to be an MTS in which all of the
variables are propositions, and in Chapter 3 we outline a process to convert from
MTS into PMTS. A propositional variable can either be true or false at any moment
in time and can change value multiple times within a PMTS. A contiguous period of
time during which the propositional variable is true is called a
uent. Specically, a
uent is a tuple containing three elements: the name of the proposition, the time at
which it becomes true, and the time at which it becomes false. A collection of
uents
constitutes an episode, and instances of an activity are represented as episodes.
To illustrate, we use data from the Wubble World simulator. Wubble World
is a virtual environment with simulated physics, in which softbots, called wubbles,
I II III IV V
Internal
Motor
Commands
Time
Phases
Figure 1.2: An example PMTS for the activity jump over. The activity is subdivided
into subregions for descriptive purposes.
schematic. Every example shares a common set of propositions, although not every
proposition becomes true in every episode. For instance, the visual schematic for
the approach in (a) does not contain a wall or a second box; therefore, these propo-
sitions never have the opportunity to become true. For the sake of brevity, we have
omitted additional variables that may also be necessary for classifying episodes of
approach, like internal desires and goals.
The purpose of this dissertation is to design, develop, and evaluate computa-
tional representations and learning algorithms that learn to recognize the activities
agents perform. The activities take place within simulation, either 2D or 3D. They
range from actions the agent is trying to perform, like jumping over blocks, to more
position of two agents, and the relative velocity of the two agents. Each observation
is real-valued, but we shall later see that we can convert these numbers into binary
propositions with a straightforward algorithm discussed in Chapters 3 and 6.
Our goals in this work are to add as little hand-crafted knowledge as possible to
the representation of episodes discussed in future chapters and to produce algorithms
that perform well over a variety of dierent activity types. Any patterns found in
episodes for similar activities should be discovered by the learning algorithm and
not by us, the designers. By dierent activity types, we mean activities outside the
domain of agent interactions, for example, another activity might involve recognizing
a subject's handwriting or recognizing an abnormal heartbeat.
Performance is measured on three dierent tasks: classication, inference, and
recognition. In a classication task, the agent is presented with an episode it has
never encountered before and needs to produce the correct activity label to describe
the corresponding activity. The inference task requires the agent to infer the state of
internal variables over the course of the activity on the basis of previously observed
and labeled data. In inference, we are given the observable variables in the episode
and must correctly infer the internal state of the agents participating in the activity.
Inference can only be achieved if the agent can correctly classify episodes. The last
task is recognition, and it involves watching the state of the world change over time
and determining whether or not an activity has occurred. This task is intrinsically
harder than classication since the boundaries of an activity are not marked.
Some challenges are illustrated by the examples of approach activities introduced
in Figure 1.3. First, episodes have dierent durations. In each example in Figure 1.3
the agent is moving forwards towards the block and at some point, stops performing
the action forward(agent), while inertia continues moving the agent forward until
it comes to a stop in front of the box. Visually, this corresponds to a common
pattern, highlighted in red in Figure 1.3, that diers in duration for each episode.
A second challenge is to determine which of the many propositions are irrelevant
to the activity taking place. Sometimes examples of an activity contain additional
propositions, such as turn-left(agent) and turn-right(agent) in example (d).
These propositions are considered noise in the context of a specic activity; whereas,
the remaining propositions aid classication algorithms. Another example of irrel-
evant propositions is found in Figure 1.3(b). It just so happens that as the agent
is approaching the box, it also is approaching a wall behind the box. The variable
distance-decreasing(agent,box2) is not critical to a semantic interpretation of
approach and will only serve as a distraction in the classication task.
Even after determining which propositions to attend to, we still need to deter-
mine which
uents, if any, are important. It may be necessary to lter noisy
uents,
or it may just be the case that a certain
uent does not contribute anything to a se-
mantic interpretation of the activity. Consider the sample activity in Figure 1.3(e).
The agent must move around a wall that is blocking its path to the block it is ap-
proaching. Unlike the example in Figure 1.3(d), the agent moves parallel to the wall
and the distance to the block is not decreasing. Once the agent navigates around
the wall, then the distance begins decreasing again. In this example, there are two
uents in which the proposition distance-decreasing(agent,box) becomes true.
None, one, or both of these
uents may help in the classication and prediction
tasks presented earlier.
In the coming chapters we will describe the CAVE algorithm, designed to classify
and visualize episodes. CAVE is trained to classify activities through supervised
learning with labeled episodes, such as instances of aggressive agents bullying smaller
agents. Visualization produces a descriptive image of the sometimes complex inter-
actions between
uents. An agent built with CAVE learns to classify activities by
rst performing them itself. Therefore, it has access to both observable aspects of
activities such as motion, and private aspects such as intentions, emotional states,
and motor commands. It can also use observations of other agents to retrieve activ-
ities from memory and project the hidden or private aspects of these activities onto
other agents.
1.3 Overview
The rest of this dissertation is organized as follows. Chapter 2 provides a literature
review of previous relevant work done in articial intelligence and data mining.
Chapter 3 details the representations designed to aid in the recognition of activities.
Chapter 4 details several learning and visualization algorithms the work directly
with the representation described in the preceding chapter. Chapter 5 provides
empirical results on several tasks within our agent base simulations. Chapter 6
provides a deeper analysis into the representations and algorithms within the agent
based simulations and across multiple datasets. Chapter 7 discusses avenues for
further research and draws some nal conclusions.
ple (Iacoboni et al., 2005; Agnew et al., 2007). Although we do not claim that an
agent augmented with the representations and learning algorithms described in this
dissertation understands intentions in the same way that humans do, both lines of
research propose a mechanism by which one agent understands the intentionality
of another by relating the observed actions with one's own previous memories and
internal states. We will see this trend again when we discuss the algorithm for
inferring intention in Chapter 4.
2.2 Computational Approaches
2.2.1 Pattern Mining
People have tried to solve many tasks that require as input multivariate time series.
Some of the earliest work set about trying to extract common patterns or rules
from propositional multivariate time series (PTMS). Most of this research focuses on
extracting temporal patterns that occur with frequency greater than some threshold,
also known as support. Temporal patterns are commonly described using Allen
relations (Allen, 1983). Allen recognized that there is only a small set of relationships
that two propositional intervals can be in, see Figure 2.1. The Allen relation between
two intervals is determined by the intervals' start and end times.
In temporal pattern mining research, there are several critical choices that re-
searchers must make in order to nd interesting patterns. First they must decide
on a measure that determines what makes one pattern better than another (often
called support). Secondly, researchers must select a representation of the temporal
relationships between propositions. Any representation must solve two problems:
First, a pattern consisting of 3 or more intervals can be represented as compositions
of Allen relations in several ways (e.g., we could say meets(a; (during(b; c))) or
during(meets(a; b); c)). A canonical form is desired, and is provided by several re-
searchers (Winarko and Roddick, 2007; Hoppner, 2001b). Second, a representation
should capture all the Allen relations that exist between propositions. A sentence
such as during(meets(a; b); c) does not say whether c occurs during a,b or both,
(x equals y)
(x meets y)
(x finishes-with y)
(x starts-with y)
(x overlaps y)
(x during y)
x
y
x
y
x
y
x
y
x
y
x
y
(x before y)
x
y
(x e y)
(x m y)
(x f y)
(x s y)
(x o y)
(x d y)
(x b y)
Figure 2.1: Allen Relations
though one of these must be true.
Kam and Fu (2000) present an interesting canonical form, based on right con-
catenations, for compositions of three or more
uents. However, this representation
does not capture all the Allen relations in a composition. In the algorithm presented
by Kam and Fu, the frequency of the pattern was used to decide whether or not a
pattern was interesting. The algorithms presented in (Cohen, 2001; Cohen et al.,
2002; Fleischman et al., 2006) admitted a larger set of possible patterns than Kam
and Fu, but lacked a canonical representation of patterns. This research focuses on
extracting patterns that are determined to be statistically signicant through hy-
pothesis testing. This is the only work we know of relating support for a pattern to
evidence against the null hypothesis that no pattern exists. Other research (Hoppner
and Klawonn, 2002; Winarko and Roddick, 2007) uses a matrix representation that
captures all k(k 1)2 pairwise Allen relations between k intervals, and again uses fre-
quency to determine the support of a pattern. In Hoppner and Klawonn (2002),
support is dened as the frequency within a sliding window, ensuring a degree of
locality between the internal relationships of the pattern, whereas in Winarko and
Roddick (2007) the support metric is similar to Kam and Fu (2000).
similar to those just discussed, but these patterns are based on a subset of the
original Allen relations, specically before and overlaps. The complete set of large
patterns is pruned by hypothesis testing to generate a subset of patterns that will
serve as a binary feature vector. Each training episode results in a single binary
feature vector such that each value is true when the corresponding pattern is found
within the episode and false otherwise. The feature vectors and class labels are
used to train a traditional classier (e.g. SVM). One problem with this approach is
that the classier must be completely retrained whenever new training episodes are
acquired.
Kadous and Sammut (2005) present a multivariate classication algorithm that
operates on real valued time series and learns metafeatures to augment the original
data. Features cited as metafeatures include local maxima and gradient informa-
tion. These augment the original data with propositional features. A traditional
classier is trained on all of the time series information, the original data and the
metafeatures. The rules generated from the classier provide some insight into the
decisions made by the classier.
Weng and Shen (2008) propose to use two-dimensional singular value decompo-
sition (2dSVD) as a tool for classifying multivariate time series. The authors treat
the MTS as a large two-dimensional matrix where each row is a dierent variable
and the columns correspond an observation at some specic moment in time. A fea-
ture matrix is obtained for each MTS and during classication a nearest-neighbor
classier selects the class label for the most similar feature matrix. Other researchers
have preferred to classify time series by working directly on the original time series
without performing any transformations (Oates et al., 2000; Gromann et al., 2003;
Yang and Shahabi, 2007, 2004; Morse and Patel, 2007).
Another commonly used statistical approach to classication of multivariate time
series is to train hidden Markov models (HMM) (Nathan et al., 1995; Lee and Xu,
1996). HMMs have proven useful for speech recognition, as well as for classifying
words in American Sign Language (Starner, 1995). A commonly cited problem with
HMMs though is that the structure must be specied a priori. There have been
some approaches to help mitigate this problem; for instance, (Firoiu and Cohen,
1999) present a state splitting algorithm that attempts to learn the structure of the
HMM by greedily reducing the size of the resulting representation.
Gesture recognition in both 2D and 3D is a classication problem that involves
multivariate time series. In (Bobick and Wilson, 1995), the authors present algo-
rithms that reduce a multivariate time series into a single sequence of states and
then they employ a dynamic programming solution to nd the distance between two
gestures.
2.2.3 Univariate Time Series Classication
There is far more research dealing with classication/clustering/indexing of univari-
ate time series than there is with multivariate time series. Within this research
there are several dierent problems being addressed. Some researchers focus on de-
veloping new distance metrics to compare univariate time series, while others focus
on reducing the amount of data used in comparisons. In addition, some researchers
examine ways to convert real-valued time series into symbolic sequences.
There are several dierent distance metrics for comparing two real-valued uni-
variate time series, and each of these has many extensions. One of the most straight-
forward is the longest common subsequence (LCS) algorithm. LCS provides a mech-
anism to align two time series that may suer from translational shift (one starts
later than the other). It relies on dynamic programming to nd the largest subse-
quence that is common between the two time series. One common way to do this
is to begin with real-valued time series, convert them into a symbolic form, then
use LCS as a distance metric to compare them (Devisscher et al., 2008; Balasko
et al., 2006; Wang et al., 2005). Some authors use LCS on the original, unaltered
time series by thresholding the equality operator when comparing two values in the
sequence (Vlachos et al., 2002, 2006; Buzan et al., 2004). Similarly, Chhieng and
Wong (2007) propose an extension to the string edit distance that relies on the dis-
tance between two points in
uencing the cost matrix setup by LCS and string edit
distance.
Another popular distance metric for comparing two univariate time series is
dynamic time warping (DTW). As the name implies, dynamic time warping was
designed to handle translational shifts in time and was originally created for speech
recognition (Sakoe and Chiba, 1978). One benet of dynamic time warping is that
it can work directly on real-valued time series. Berndt and Cliord (1994) present
early results demonstrating the feasibility of DTW on univariate time series and
provide initial evidence that once DTW extracts patterns from the time series, then
we can construct higher order rules to describe transitions between the patterns.
Other researchers have used and extended DTW in order to perform classication
on univariate and multivariate time series (Oates et al., 1999; Keogh and Pazzani,
2001; Fu et al., 2008)
Most of the work on classication of time series data has focused on real valued
time series. In (Lin et al., 2003), the authors propose a new way to convert real-
valued time series into symbolic time series and demonstrate that even with a naive
distance calculation between symbolic time series, they are still able to generate
good results. Balasko et al. propose a way to convert a univariate time series into
a symbolic sequence and then perform sequence alignment in order to gather more
insight about the processes generating the time series (Balasko et al., 2006). The
authors do not perform a classication task and instead focus on understanding the
underlying process generating the time series.
2.3 Remarks
In this chapter, we could only scratch the surface on a large and diverse eld. For
a thorough survey of temporal pattern mining and classication work see (Mitsa,
2010; Galushka et al., 2006; Liao, 2005).
are represented identically. The CBA corresponding to the dynamics in Figure 3.3
is shown in Figure 3.5.
1 1 1 1 1 1 1 1 0 0 0 0 0 0
0 0 0 0 1 1 1 1 1 1 1 1 1 1
1 1 0
0 1 1
Input Data CBA
p
1
p
2
p
1
p
2
Figure 3.4: Discarding all but one of identical columns (shown shaded) in a bit array
produces a compressed bit array (CBA).
collision(agent; box) 0001
distance decreasing(agent; box) 0110
forward(agent) 1100
speed decreasing(agent) 0010
Figure 3.5: The CBA representation for the approach example in Figure 3.2
It is important to note that the CBA representation conserves all of the Allen
relations present in the original PMTS. There is a direct mapping from the CBA
representation to the table representation outlined earlier, but we prefer the CBA
representation because it is easier to visualize the interactions between the proposi-
tions. The compressed bit array can be used to represent relationships between an
arbitrary number of
uents, known as the order of the CBA. CBAs generated from
pairs of
uents correspond directly with the Allen relations, and for simplicity, we
write these CBAs with the corresponding Allen relations.
In the previous discussion we described how to represent relations between
u-
ents, but not how to construct sequences from these relations. Now we present an
algorithm for generating an order k relational sequence from a PMTS composed
of the
uents F . Here we dene k to be the number of
uents that participate in
each relationship varying from 2 : : : jFj. The rst step is to enumerate all of the k-
combinations of
uents, of which there are
jFj
k
. A k-combination of
uents from F
is a set of k distinct
uents from F . Next the
uents in each k-combination are sorted
in order to generate a canonical representation of the CBA. Ordered
uents, also
end up with a sequence with one less relation. The interaction window applies to
all of
uents within a CBA, and if any
uent is further than the interaction window
from all of the other
uents, then the CBA is disregarded.
Qualitative sequences reduce multivariate propositional time series to one di-
mensional sequences of relations between propositions. Sequences lose information
about the duration of the individual intervals, but retain information about the
start and end times of each interval relative to another, and preserve the temporal
order of CBAs. Assuming no pruning from an interaction window, the maximum
size of a sequence of order k CBAs generated from an episode containing n relations
is
n
k
. Furthermore, if an episode has j propositions, then the alphabet size for the
symbols in the sequence is at least 7 (j2 j). it can be more if a proposition turns
on and o multiple times..
3.2 Real-Valued Variables
Variables within a time series often take on real values. Two such examples, common
to the types of activities we are interested in, are the distance between two moving
objects and the speed of an agent. In this dissertation, real-valued time series are
converted into collections of propositional time series. We will illustrate how this is
done with the time series called speed in Figure 3.6.
3.2.1 Preparation
We assume that there is error in the observations for our time series. This error could
come from faulty sensor readings on a robot or inaccurate physics approximations
in a simulated environment. If necessary, we rst smooth the time series. Let X be
a univariate time series and let xi be one of the values of X. The smoothing process
works by modifying each value xi 2 X, on the assumption that xi has an error
component that can be reduced by replacing xi by a weighted average of xi and its
neighbors. The eect of the moving averages smoothing technique is reduced error
so that sensors more closely approximate a smooth function. Although there are
time series speed and three dierent values of k, 5, 25, 100.
Time
Spee
d
Legend
k=25k=5k=100Original
Figure 3.7: The eects of the moving averages smoother on the original time series
for speed with three separate values of k.
The second step in the preparation of each univariate time series is standard-
ization. To standardize a time series, we rst need to nd the mean, x, and the
standard deviation, s, of the time series. We generate a new time series Z, such that
for each data point in the original time series we subtract the mean and divide by
the standard deviation:
zi =
xi x
s
Each value in the resulting time-series is known as a z-score and it indicates how
many standard deviations above or below the mean the original observation was.
The resulting time-series will have a mean of zero and a standard deviation of one.
Standardizing time series does not change their shape. If two series do not look
like each other then standardizing them will not make them similar, and in this sense
standardizing cannot do any harm { but standardizing does remove dierences due
to units of measurement. Standardizing two series makes them have the same mean,
and expresses each in standard deviation units. For instance, if X and Y are two
series and xi = myi for all xi and yi, then standardizing X and Y makes them
identical.
3.2.2 Symbolic Conversion
Lin et al. (2007) present an algorithm, called SAX, for converting a univariate real-
valued time series into a symbolic sequence. SAX produces a sequence of symbols,
S, given a series of z scores, Z, obtained by standardizing a series of reals, R. The
distribution of values in Z is assumed to be Gaussian, or Z N(0; 1). The claim
by the SAX authors that all time series have a Gaussian distribution is clearly
false, and it is unknown what eects this incorrect assumption has on performance.
Nevertheless, SAX provides a convenient method to convert real-valued time series
into symbols and overall performance is good, as seen in Chapters 5 and 6.
The SAX algorithm maps real values within intervals to symbols that identify
the intervals. The number of unique symbols generated by the SAX algorithm is
controlled by the parameter a. Selecting a determines how to select breakpoints
that divide the Gaussian distribution into a equally-sized areas. The assumption
of a Gaussian distribution allows us to determine the values of the breakpoints by
looking them up in a statistics textbook or in Table 3.2. We replace all of the
values in the standardized time series smaller than the smallest breakpoint with the
symbol a. Next we replace all of the values in the time series that are smaller than
the second smallest breakpoint with b, and so forth until all numbers are replaced
with symbols. The resulting SAX sequence is shown at the top of Figure 3.8 for
a = 3, and the bottom half presents the original time series with the breakpoints in
place.
The authors of the algorithm further show that the distance between two sym-
bolic sequences generated using SAX provides a lower bound on the distance between
the original two time series. This proof is an extension of the one that was generated
for the authors' dimensionality reduction technique called PAA. We depart from the
original SAX algorithm at this point since we do not perform any dimensionality
reduction on our time series. Dimensionality reduction makes sense when all of the
of symbolic values. So, we create three new propositional variables in the PMTS,
one for each variable and transition category label (up, down, stable). The new
variables are true at every time point that they occur in the symbolic sequence.
Figure 3.9 demonstrates the areas within the original speed time series that would
be encoded with the same symbol. The light gray regions represent periods of time
where the rst derivative is stable, darker gray regions represent periods of time
where the rst derivative is less than zero, and the darkest gray regions represent
periods of time where the rst derivative is greater than zero.
3
-1
-0.5
0
0.5
1
1.5
2
2.5
Time
Spe
ed
Figure 3.9: The time series for speed highlighted to show the symbols generated
through the conversion to our SDL.
The propositions constructed from the SDL conversion process aid performance
on the tasks outlined in Chapter 1. An analysis is included in Chapter 6.
3.3 Wrapping Up
In this chapter we outlined our qualitative sequence representation for propositional
multivariate time series. We provided additional algorithms for converting real-
valued univariate time series into multivariate propositional time series. We can
repeat this process for each real-valued variable within the original MTS. Depending
on the selected SAX alphabet size, a, each real-valued variable will result in a+3
new propositional variables in the nal PMTS. Moving forward we explore how the
sequences will be used for recognizing activities.
CHAPTER 4
LEARNING AND INFERENCE
Recall the purpose of this dissertation: to design, develop, and evaluate computa-
tional algorithms that are able to recognize the behaviors of agents as they perform
and execute dierent activities. The previous chapter described the representations
of activities that we will be working with. In this chapter, we focus on the algorithms
that learn to recognize activities. The algorithms will be evaluated on two tasks:
classication and recognition. In the classication task, we are given an unlabeled
episode and must select the correct activity label, whereas in recognition we must
nd episode boundaries and then determine which episode occurs if one occurs at
all.
To aid in both of these tasks, we describe and build an aggregate structure, called
a signature, from episodes of an activity. Episodes of an activity are represented as
sequences of tuples containing a symbol and the
uents that generated the symbol.
At its core the signature relies on sequence alignment to nd similar subsequences
between it and another sequence. First we describe the process used to perform
sequence alignment. Next we discuss a novel algorithm that builds signatures from
episodes of an activity. Finally, we discuss several applications of signatures, such
as visualization and online activity recognition.
4.1 Sequence Similarity
In Chapter 3 we described two families of representations each composed of se-
quences of tuples. Each tuple is an ordered list containing a label (symbol) and the
set of
uents that participate in the label. In this section we ignore the
uents and
focus on the labels. We would like to identify similarities between sequences that
encode episodes from the same activity, for example we would like to identify the
similarities between the approach episodes described in Chapter 1. Furthermore, it
would be benecial to use that measure of similarity to capture the distance from
one sequence to another. We would expect that episodes of the same activity have
small distances whereas two sequences from dierent activities would have larger
distances. A general solution to these tasks is to align the two sequences in such
a way as to maximize the overlap between them (de Carvalho Junior, 2002). We
choose to align the sequences because the tuples within the sequence are temporally
ordered and set intersection or simply counting the overlap would not take into
account this ordering.
We can easily visualize the alignment of two sequences by writing one sequence
on top of the other, as shown in Figure 4.1. The top sequence in Figure 4.1 is the
event sequence for the episode in Figure 1.3(c) and similarly the bottom sequence
is the event sequence for the episode in Figure 1.3(d). Spaces are inserted into the
top or bottom to break the sequences into smaller sequences so that the smaller
sequences have matching symbols. The resulting alignment ensures that the two
sequences are of equal length.
f(a) dd(a,b) dd(a,b2) sd(a) c(a,b)
l l l l
f(a) dd(a,b) dd(a,b2) tr(a) tl(a) tr(a) di(a,b2) sd(a)
Figure 4.1: An alignment between two sequences.
The objective of sequence alignment is to match as many symbols within the
sequences as possible. In the example, four symbols match; each highlighted with
a vertical bar. Spaces inserted into the alignment are represented as dashes. These
spaces produce gaps in the sequences, yet they are necessary to produce a good
alignment between the two sequences. Although not present in Figure 4.1 sym-
bols are sometimes substituted for each other. Visually, this would correspond to
one symbol being on top of another without a bar linking them. We can envision
instances where substitution would be useful, say when an agent approaches two dif-
ferent boxes in separate episodes of approach. In this case, the propositions would
almost completely match except for which box is being approached. Some of the
Algorithm 1: SequenceAlignment
Input: sequences X,Y with lengths m;n
Output: (m+ 1) (n+ 1) similarity table S
begin
for i=1 to m do
S[i][0]( S[i 1][0] + ins(X[i])
for i=1 to m do
S[i][0]( S[i 1][0] + del(Y[i])
for i=1 to m do
for j=1 to n do
diagonal ( S[i 1][j 1] + sub(X[i]; Y [j])
left ( S[i][j 1] + ins(X[i])
up ( S[i 1][j] + del(Y [j])
S[i][j]( max(diagonal, left, up )
end
Algorithm 1 provides the details for constructing the table S. The functions
sub(i; j), ins(i), and del(j) are the cost of substituting symbols X[i] and Y [j],
inserting symbol X[i], and deleting Y [j] respectively.
The table S is lled by row from left to right. The rst row and rst column
are lled in according to the base conditions, and the recurrence relation is used
to ll the table one row at a time. The cell S[m;n] contains the optimal similarity
score for the two sequences X and Y . Table 4.1 contains the similarity table for the
0 1 2 3 4 5
; f(a) dd(a,b) dd(a,b2) sd(a) c(a,b)
0 ; 0 -1 -2 -3 -4 -5
1 f(a) " -1 - 2 1 0 -1 -2
2 dd(a,b) " -2 " 1 - 4 3 2 1
3 dd(a,b2) " -3 " 0 " 3 - 6 5 4
4 tr(a) " -4 " -1 " 2 " 5 " 4 " 3
5 tl(a) " -5 " -2 " 1 " 4 " 3 " 2
6 tr(a) " -6 " -3 " 0 " 3 " 2 " 1
7 di(a,b2) " -7 " -4 " -1 " 2 " 1 " 0
8 sd(a) " -8 " -5 " -2 " 1 - 4 3
Table 4.1: The similarity table S with arrows for tracing optimal alignments. The
optimal alignment is highlighted by the light gray squares.
Symbols Weight
forward(agent) 4
distance-decreasing(agent,box) 4
distance-decreasing(agent,box2) 3
distance-stable(agent,box2) 1
turn-right(agent) 1
turn-left(agent) 1
turn-right(agent) 1
distance-increasing(agent,box2) 1
speed-decreasing(agent) 4
distance-increasing(agent,box2) 1
collision(agent,box) 3
Table 4.2: The signature constructed from the rst four examples in Figure 1.3.
inserted into Sc at the location selected by the alignment algorithm.
Because the process of updating signatures does not remove anything from any
of the sequences in S, signatures become packed with large numbers of symbols
(propositions or Allen relations depending on the representation) that occur very
infrequently, and thus have low weights. Heuristics help reduce the number of low
frequency symbols occurring in each signature. We present one simple heuristic to
clean up the signature: After updating the signature t times, all of the relations
in the signature with weights less than or equal to n are removed. For example,
we set t = 10 and n = 3, meaning that the signature is pruned of all relations
occurring less than 4 times after a total of 10 training episodes. The signature is
again pruned after 20 training episodes, and so forth. The eects of pruning are
explored in Chapter 6.
In Table 4.4 we present signatures built from our examples of approach. In-
frequently occurring symbols, or symbols seen fewer than three times, have been
pruned from each of the signatures. The smallest signature is built from start event
sequences, and the longest signature is built from sequences composed of both start
and end events. In Figure 1.3 each example of approach has three red
uents cor-
responding to a shared pattern between the examples. This pattern is preserved in
uents become important now as we discuss how to determine the heat index of
an
uent. Let f = f1 : : : Ng be a set of tuples in the sequence Si that reference
the
uent f. When Si is aligned with a signature Sc, each tuple in f will either
match a symbol in Sc or it will not. The heat index for a
uent f is the sum of the
weights of the symbols in Sc that are matched by tuples in f . The heat indexes
for each interval in the original activity are normalized by the largest heat index
ensuring that heat index values range from zero to one. The heat index of an interval
determines its color in the heat map.
Signature Sequence
(f(a) c dd(a,b2)) 3 (f(a) o dd(a,b))
(f(a) o dd(a,b)) 5 (f(a) m sd(a))
(f(a) m sd(a)) 5 (dd(a,b) f sd(a))
(dd(a,b) f sd(a)) 5 (f(a) b c(a,b))
(dd(a,b) m c(a,b)) 3 (dd(a,b) m c(a,b))
(sd(a) m c(a,b)) 3 (sd(a) m c(a,b))
Figure 4.2: A heat map generated from the approach signature trained on Allen
relations and the Allen sequence for episode (a) in Figure 1.3. The heat index for
each
uent is written on the
uent.
To illustrate why this is a good visualization technique, we present ve heat maps,
one for each sequence representation (event and relational) on a single example of
approach. Each qualitative sequence is derived from the approach example shown in
Figure 1.3(e), and the signatures for each representation come from Table 4.4. We
chose this example because it highlights dierences in the signatures that may not
have been apparent before, specically how each signature handles the proposition
dd(a,b). This particular proposition is crucial to the activity of approaching a box.
In Figures 1.3(a)-(d), the proposition dd(a,b) only becomes true once, meaning
that once the agent starts towards the box, it does not falter. In Figure 1.3(e) it
becomes true twice, because the agent must navigate around a wall that is blocking
cannot perceive the motor commands, emotional state, and intentional states of
agent2.
Our approach to inferring unobservable propositions is to have agents learn signa-
tures of their own behaviors, in which these propositions are not hidden. Therefore,
it has access to both observable aspects of activities such as location and motion,
and private aspects such as intentions, emotional states, and motor commands.
Then, when an agent observes another's behavior, it matches the states of observ-
able propositions to signatures of its own behavior, and uses these to infer the states
of unobservable propositions in other's behavior.
To illustrate, assume that the signatures in Table 4.4 were learned by an agent
and the observed behavior of another agent does not include proposition f(a) since
it corresponds to a motor command. Focusing on the signature from Allen relations,
the rst agent would infer that the following hidden relations containing f(a) must
also be true: (f(a) contains dd(a,b2)) , (f(a) overlaps dd(a,b)) , and (f(a)
meets sd(a)) .
In general, signatures can contain many tuples constructed from hidden proposi-
tions. The most frequent of these tuples are the most likely to occur when observing
other agents perform the same activity that the signature is trained on. Therefore,
our agent selects the most frequently occurring tuples with symbols constructed
from hidden propositions to be the inferred hidden state. In the previous example,
if we were to set = 2, then we would remove the relation that occurs the least, in
this case (f(a) contains dd(a,b2)) .
4.6 Wrapping Up
In this chapter we described signatures, our aggregate structure that captures the
most frequently occurring patterns in the episodes provided as training data. We
rely on sequence alignment to nd matching symbols in sequences generated from
episodes in order to construct this structure. Signatures can be used to generate
heat maps which are a convenient way to visualize the overlap between an activ-
ity and a specic episode. In online recognition of activities, signatures can be
employed to select important propositions that helps generalize FSM recognizers
constructed from training data. Finally, signatures contain the internal state that
can be projected onto other agents when they are perform an activity.
Num Examples Time Num Fluents Allen CBA
approach 25 343.40 34.84 572.72 7,175.68
jump-on 20 436.90 80.45 1,356.70 27,531.75
jump-over 37 350.14 51.35 1,156.51 17,211.32
left 25 615.88 95.24 3,228.16 55,928.96
push 25 344.84 99.40 3,802.20 93,087.28
right 25 629.68 95.04 3,276.44 61,431.36
Table 5.1: Average values for the episodes in the Wubble World dataset.
5.1.2 Wubble World 2D
Like Wubble World, Wubble World 2D (ww2d) is a virtual environment with simu-
lated physics. The purpose of ww2d is to address some of shortcomings of Wubble
World, specically the wubbles' lack of cognitive and emotional systems. The agents
in ww2d are distinguished from wubbles because of these additional systems. Wub-
ble World 2D was also designed to allow us to quickly create unique episodes for
individual behaviors inspired by the the original movies that were part of the re-
search conducted by Heider and Simmel (1944). A screenshot of the Wubble World
2D simulator is shown in Figure 5.1.
Figure 5.1: Screenshot of the Wubble World 2D simulator. The agent is the black
and red circle, and it can interact with food (the Red Cross symbol), the soccer ball.
Like the original Wubble World, all of the interactions for an agent are recorded
for post-hoc analysis. Each agent has its own unique view of the world based on its
perceptual system. In WubbleWorld, wubbles had a global view of the world, but
in ww2d the agents have a egocentric view of the world, meaning that if the agent
cannot sense an object, nothing is recorded. An egocentric view of the world helps
focus the agent's attention on the things within its sphere of in
uence and reduces
the number of variables recorded at any one time. At every time step we record
the current position, speed and heading of an agent, as well as the internal state of
the agent consisting of its energy level, arousal, valence, active goal, and the active
state of the executing FSM. For every other agent or object within our sensing area
we record the relative position, relative velocity, distance, whether or not there is
currently a collision between the agent and the object, and whether the object was
seen, smelt, or heard.
We collected a dataset of episodes of agents in ww2d performing several kinds
of activities: chasing another agent,
eeing from an aggressor, ghting with another
agent, kicking a ball, kicking a static object, and eating food to gain energy. Episodes
are generated automatically by limiting the active goals of the agent to elicit the
types of behavior we expected. Each episode was unique, in that the objects started
in random locations and wandered dierent amounts of until they found an object of
interest. We recorded 20 episodes for each activity. Table 5.2 contains the average
statistics by activity for the episodes in the ww2d dataset. Despite being 2D, the
interactions and examples of activities were much more complex than activities in
the original Wubble World. Variables are sampled every 1/80th of a second in
ww2d, much more frequently than in the Wubble World dataset. Each activity in
ww2d takes roughly the same amount of time to complete and more time is spent
performing an activity in ww2d than the activities from the Wubble World dataset.
The number of
uents are varied across activities, with the simplest in terms of
the average number of
uents activity being eating and the most complex being
when two agents ght. The last two columns contain the average number of tuples
in the Allen sequence and the CBA sequence respectively. The length of the Allen
activities as approach is not that surprising since it is very similar to the other
activities. Additionally the signature for the approach activity is shorter than the
signatures for the other activities. This means that most episodes align well with
the signature for approach and since the distance function attempts to maximize the
amount of weight accounted for by the sequence, approach tends to do well.
Observed
approach jump-on jump-over left push right
Predicted
approach 25 0 0 0 0 0
jump-on 17 3 0 0 0 0
jump-over 9 0 28 0 0 0
left 19 0 0 6 0 0
push 21 0 0 0 4 0
right 11 0 0 0 0 14
Table 5.4: The confusion matrix for the CAVE classier on start event sequences
from the ww data.
The CAVE performance on Allen sequences is much higher than the others, and
we shall see that this remains consistent for most of the datasets. We will have more
to say on this in Chapter 6.
5.2.2 Wubble World 2D
In this section we present the performance of the CAVE algorithm and the k-NN
classier as the average number of correctly classied episodes in a 10-fold cross
validation across 120 episodes and six classes, shown in Table 5.5.
Like before, the k-NN classier performs as well or better than the CAVE clas-
sier on all representations of the Wubble World 2D data, but this time across
the board all representations and classiers perform very well on the ww2d dataset.
This is a surprising result considering that earlier we argued that the ww2d dataset
is more complex. We explore why performance is so high in the following section.
Figure 5.3: Learning curve of the CAVE algorithm generated by presenting labeled
training instances one at a time to corresponding signatures.
labels. After seeing 24 episodes (on average only four episodes per activity) the
CAVE agent is able to classify almost 70% of the test set correctly. This suggests
that the learned signatures quickly identify relations within the training episodes
that allow the algorithm to correctly classify activities.
5.3 Heat Maps
In Chapter 4, we presented a mechanism to visualize signatures, called heat maps.
The heat map representation highlights parts of an episode that align with a sig-
nature. Additionally, heat maps provide a way to visually illustrate dierences
between episodes with the same or dierent activity labels. Here we present three
heat maps. Heat indexes are determined from a signature trained on relational se-
quences of jump over episodes. We generated a heat map for a jump over episode,
a jump on episode, and an approach episode. In each heat map, time runs from left
to right and darker
uents are aligned higher frequency tuples in the signature.
Figure 5.4(a) contains the heat map for the jump over episode. Looking at
the darker intervals as we move from left to right, we see that the wubble starts
out on the
oor. From here the wubble begins moving forward while box0 is in
front of it. After some period of time, the wubble jumps, as indicated by the
proposition Jump(wubble). This results in the wubble moving upwards, towards
are shared with the signature for a class.
The nal heat map comes from an approach activity, shown in Figure 5.5. In
this example, there is far less overlap than in the previous examples since ap-
proach is more dierent from jump over than jump over is dierent from jump on.
Most of the overlap between the jump over signature and approach sequence corre-
sponds to the motion of the wubble, i.e. Forward(wubble) and Motion(wubble).
This overlap occurs because both activities require that the wubble move forward.
There is one surprise in Figure 5.5 though, and it comes from the proposition
Towards(wubble,box4). This proposition is semantically unimportant to either
activity but is highlighted by a large heat index because enough of the examples
of jump over approach box4 while jumping over box0. Additional training data in
which the wubble is not approaching box4 while performing jump over will reduce
the heat index for this
uent.
Figure 5.5: The heat map for an approach episode aligned with the signature for
jump over.
All of the example heat maps presented in this section were taken from the
Wubble World dataset. Although we would like to show examples from ww2d, the
images are just too large to t onto a single page, and trying to do so will render the
text unreadable making any meaningful interpretation impossible to convey. One
such meaningful interpretation is that the heat maps for ww2d consistently highlight
the internal state of the agent, such as the the agent's goals, as the most frequently
aligned
uents.
5.4 Recognition
In Chapter 4, we introduced a way to recognize activities as they occur by construct-
ing a nite state machine (FSM) with states and transitions that match the training
data. Signatures provided a way to selectively ignore some
uents and propositions
and allows the FSM to generalize to unseen episodes of an activity. We present
an experiment in which the FSMs described in Section 4.4.1 are used to recognize
activities. The experiment demonstrates that the FSMs from Section 4.4.1 have
higher recognition performance than FSMs induced directly from the training data.
The recognition task involves building a recognizer for each activity from training
episodes. The remaining episodes, which are not part of the training set, become test
episodes, and are \played" back to each FSM recognizer. Playing an episode involves
constructing the appropriate state for each moment in time during the episode and
updating the FSM accordingly. The recognizer either accepts or rejects the test
episode. We measure three dierent values: true positives, false positives, and true
negatives. True positives (tp) occur when a recognizer correctly accepts an episode.
A false positive (fp) occurs when a recognizer accepts an episode with a dierent
activity label, and a false negative (fn) occurs when the recognizer incorrectly rejects
an episode. These values are used to compute precision, recall and the F-measure.
The precision of a recognizer is the the number of episodes correctly identied out of
the total number of episodes accepted by the recognizer, and given by the formula:
Precision =
tp
tp+ fp
:
A recognizer that only correctly accepts activities from its class will have a
precision score of 1. A recognizer that accepts many more activities than it should
will have a lower precision score, and precision scores range from 0 to 1. The recall
of a recognizer is the number of episodes correctly identied out of the total number
of episodes with the same class label, given by the formula:
Recall =
tp
tp+ fn
:
Intuitively a recognizer can get a high recall by accepting every test episode
hoping that it shares the same class label of the recognizer, but this will reduce
the precision of the recognizer. Like precision, recall can vary from 0 to 1. Last is
the F-measure (van Rijsbergen, 1979) which is the harmonic mean of precision and
recall and also varies between 0 and 1:
F-measure = 2 precision recall
precision + recall
:
Similar to classication, we perform a K-fold cross validation on the training
data from ww and ww2d. This time we choose three values of K in order to vary
the amount of training data, and thus aect what is learned by the signature. We
exclude everything from the signature not seen in at least 80% of the training data, so
if a signature is trained on 20 episodes, any tuples seen fewer than sixteen times are
excluded from alignment. This is a very aggressive exclusion rate, but by excluding
most of the lower frequency tuples away from the signature, we end up with a more
general FSM recognizer.
We found that for the most part none of the recognizers built from just the
training data ever accepted an episode that it had not been trained on. The excep-
tion to this is the FSM for approach because it is a simple activity that occurs as a
component of the other activities. Without accepting a single episode from the test
set we cannot calculate the precision, recall, or F-measure. So, instead we focus on
the results generated by the signature based FSMs, presented in Figure 5.6. On the
left hand side are the plots containing the F-measures for 2-fold, 6-fold, and 10-fold
cross validations.
Overall it looks as though signatures constructed from event sequences ordered by
end times of the
uents are the best at pruning the FSM recognizer, but signatures
trained on sequences of Allen relations have high F-measures across most of the
activities. One consistent activity across all the Wubble World data is approach. The
F-measure consistently seems to be between 0.2 and 0.4 regardless of the sequence
representation. Recall that the F-measure is the harmonic mean of precision and
recall. The recall for the approach FSM is consistently high, but the precision is
in the proper order from sequences in which they have been removed. This is a
precision test.
ww ww2d
M SD M SD
Allen Sequence 95.40% 5.17 70.83% 13.18
Table 5.7: Classication performance when some of the propositions are unobserv-
able.
Table 5.7 contains the classication accuracy of the CAVE classier on a 10-fold
cross-validation of the ww and ww2d datasets. Prior to testing each episode was
stripped of unobservable propositions. In ww, the unobservable propositions cor-
respond to motor commands such as Jump(wubble) and Forward(wubble), and in
ww2d the unobservable propositions correspond to the goals of the agent as well as
the internal aective state of the agent, i.e. valence(agent), energy(agent) and
goal(agent). The performance on ww datasets is relatively unaected by the ab-
sence of internal propositions, but on the ww2d datasets it is a much larger problem.
The signatures learned from the ww2d rely less on the environment variables, such
as positions and distances, and contain many more unobservable propositions. Even
though performance is aected, CAVE still performs above 70% accuracy.
The second part of the experiment is to see if the inferred relations from the
signature can correctly capture the aective state, assuming that the episode has
been classied correctly. We again use the cross-validation experiment design. We
build signatures for each activity in each fold when all propositions are observable.
From each signature we select the = 10 most frequent relations that contain at
least one of the unobservable propositions as the inferred relations. We also preserve
the order that relations occur within the signature so that the inferred relations can
be treated just like a qualitative sequence. An example from the activity chase
is shown in Table 5.8. We can see that a large number of the inferred relations
correspond to the current goals and states of the agent. Some subset of the inferred
relations occur in each of the test sequences, so we measure the alignment between
Signatures
approach push jump on jump over left right
approach 0.62 0.19 0.18 0.06 0.30 0.34
push 0.51 1 0.17 0.14 0.35 0.42
jump on 0.48 0.30 0.90 0.85 0.36 0.34
jump over 0.44 0.46 0.68 1 0.36 0.32
left 0.34 0.54 0.19 0.20 0.99 0.98
right 0.32 0.58 0.18 0.20 0.98 0.99
Table 5.9: A matrix showing the percent overlap between the most frequent hidden
relations in the signature and the hidden relations that exist in the test set, but are
not observable in the ww dataset.
another agent, it can select the most frequent relations as inferred hidden relations,
and they will be correct with high probability.
Signatures
column ball chase
ee ght eat
column 0.97 0.81 0.00 0.01 0.23 0.13
ball 0.18 0.97 0.00 0.01 0.16 0.13
chase 0.01 0.07 0.98 0.05 0.04 0.02
ee 0.05 0.17 0 0.97 0.06 0.11
ght 0.13 0.19 0.63 0.06 0.96 0.19
eat 0 0.02 0 0.03 0 0.97
Table 5.10: A matrix showing the percent overlap between the most frequent
hidden relations in the signature and the hidden relations that exist in the test set,
but are not observable for the ww2d dataset.
5.6 Wrapping Up
In this chapter we focused on two datasets generated from virtual worlds. We
showed that the CAVE algorithm can learn to classify and recognize activities with
high accuracy in both of these domains. The question remains though if this is
due to the representations and algorithms or because of the way that we encoded
the sensors in the virtual worlds. In the next chapter, we explore this question
more thoroughly, and argue that the performance is due to the representations and
algorithms and not specic to these virtual worlds.
HW1 HW2 HW3
Classes 26. 26. 26.
Examples 21.00 23.00 21.96
Time 69.61 124.58 66.59
Fluents 33.02 38.04 32.52
Allen 304.76 389.23 310.35
CBA 930.03 954.30 931.09
Table 6.1: Average values for the episodes in the handwriting datasets.
property that we looked for is that there is dierence between episodes of the same
activity, i.e. every character written by a person is not exactly the same. Each of
the datasets outlined here have all of these properties.
6.1.1 Handwriting
We created this dataset by collecting writing samples from three dierent subjects.
Each subject, identied as HW1, HW2, and HW3, contributed at least twenty ex-
amples of each character in the alphabet. The data was collected from a Wacom
Intuos3 pen tablet with custom software that sampled the coordinate information
of the stylus at regular intervals. Position information cannot be recorded while the
pen is not in contact with the table, but we can record that the pen was picked
up and where it was placed down next. Stroke information is important during the
conversion from real-valued time series into our symbolic SDL language. The stroke
provides markers so that the rst dierence is calculated correctly on the dataset.
Table 6.1 contains the average number of examples and other specics about the
dataset for each subject1.
The training data for each subject is treated individually and we train signatures
for each character, for each subject. This way the system learns to classify and
recognize a specic subject's writing style.
1For HW3, we were unable to recorded one example of the character 'p' as the result of user
error during data gathering.
Num Examples Time Num Fluents Allen CBA
abnormal 67 74.78 54.34 866.99 3,199.52
normal 133 87.95 62.95 917.53 3,361.83
Table 6.2: Average values for the episodes in the ecg dataset.
Num Examples Time Num Fluents Allen CBA
abnormal 127 137.99 66.57 965.45 3,736.33
normal 1067 129.89 60.30 817.71 2,754.36
Table 6.3: Average values for the episodes in the wafer dataset.
6.1.2 ECG
The electrocardiogram (ecg) dataset contains measurements of cardiac electrical ac-
tivity as recorded from two electrodes at standardized locations on the body during
a single heartbeat, called lead0 and lead1 (Olszewski, 2001). The dataset was an-
alyzed by a domain expert and a label of normal or abnormal was assigned to an
episode in the dataset. Abnormal heartbeats are representative of a cardiac pathol-
ogy known as supraventricular premature beat. Full details on how the dataset was
gathered can be found in (Olszewski, 2001). Table 6.2 contains the averages for each
label.
6.1.3 Wafer
The wafer (wafer) dataset is a collection of time series datasets containing measure-
ments from vacuum-chamber sensors during the etching process applied to a silicon
wafer during the manufacture of semiconductor microelectronics (Olszewski, 2001).
Like the ecg dataset, each sample was analyzed by a domain expert and given a label
of normal or abnormal. The average values for dierent features of the dataset are
shown in Table 6.3. There are many more normal episodes than there are abnormal
episodes in this dataset, but we shall see that it does not impact performance.
6.1.4 Japanese Vowel
This dataset was generated from nine male speakers uttering two Japanese vowels
/ae/ successively (Kudo et al., 1999). The time-series are variable length ranging
Japanese Vowel
Num. Classes 9.
Num. Examples 71.11
Time 15.56
Fluents 137.39
Allen 8,123.85
CBA 295,862.28
Table 6.4: Average values for the episodes in the Japanese vowel dataset.
from 7-29 time steps and each point is a vector of 12 real-valued coecients. There
are 640 episodes in total across the nine speakers. Table 6.4 contains the average
values for the episodes within this dataset. The episodes do not last very long, on
average only 15.56 time steps, yet they generate sequences of Allen relations that are
very long, on average containing over 8000 tuples. This is the most dense dataset
that we have.
6.1.5 Sign Language
The nal dataset consists of samples of Australian Sign Language (Auslan)
signs (Kadous, 2002). There are 27 examples of each of 95 Auslan signs captured
from a native signer using high-quality position trackers and instrumented gloves.
All samples were generated by a single signer and were collected over a period of
nine weeks. A total of 2565 signs were collected with an average length of approxi-
mately 57 frames. The magnetic position trackers recorded the (x; y; z) location of
the hand relative to a point slightly below the chin, as well as the roll, pitch and yaw
of the hand. The gloves recorded the bend position of each of the ngers. Each time
series consists of 22 variables, eleven for each hand. The statistics for the episodes
in this dataset can be seen in Table 6.5.
We were unable to build CAVE signatures from Allen sequences for each of the
classes in this dataset, nor perform classication with a k-NN classier, because
we encountered memory limitations of the computers we were using to perform the
experiments. We have several ideas for how to overcome these limitations and they
are discussed in more detail in Chapter 7. Since we were unable to complete testing
Auslan
Num. Classes 95.
Num. Examples 27.00
Time 47.29
Fluents 257.43
Allen 24,594.60
CBA 1,013,926.47
Table 6.5: Average values for the episodes in the Auslan dataset.
on this dataset, it is often left out of our analysis.
6.2 Classication
In this section we examine the performance of the representations and algorithms on
classication tasks. Experiments are performed on the two classiers described in
Chapter 5, namely a k-NN classier and a CAVE classier. We split the classiers
into separate sections because additional experiments are performed depending on
the classier.
6.2.1 k-NN Classier
We begin by establishing a baseline for each of our datasets. We perform the same
classication task, described in the previous chapter, on each dataset. The results
for a 10-fold cross validation are shown for 10-NN in Table 6.6. In general, classi-
cation accuracy is above 75%, and for the Wubble World, handwriting, and wafer
datasets we are performing well above 90% classication accuracy. The k-NN classi-
er performs the worst on the vowel dataset which may be attributed to the density
of the data.
Based on the results in the Table 6.6, it does not appear to matter which rep-
resentation we construct a sequence from. For a given dataset, 10-NN performs
with roughly the same accuracy. This is conrmed with a two-way factorial analysis
of variance with repeated measures (Table 6.7). We chose to use a repeated mea-
sures design since each dataset was evaluated via a K-folds cross-validation (Cohen,
starts ends both Allen CBA
M SD M SD M SD M SD M SD
ww 97.62 3.11 96.90 4.72 98.12 4.05 97.03 5.10
ww2d 98.33 3.51 96.67 4.30 99.17 2.64 99.17 2.64
HW1 98.27 1.42 97.63 1.76 98.53 1.21 97.88 1.91 96.99 2.12
HW2 98.85 1.38 97.88 1.32 98.65 1.33 98.21 1.47 95.83 1.29
HW3 96.67 2.80 96.92 2.15 96.79 1.71 94.61 2.38 92.23 2.62
Auslan 84.07 1.76 80.25 2.23 84.88 2.22
ecg 82.65 7.75 81.07 9.38 81.07 8.89 83.65 8.46 83.12 7.60
wafer 97.15 2.03 96.99 1.54 97.40 1.88 97.74 1.27 97.57 0.84
vowel 71.91 3.89 72.97 5.14 76.72 3.83 76.07 5.96
Table 6.6: Percent correct with a 10-NN classier from a 10-fold cross validation
classication task.
1995)2.
Source df F p value
Dataset 7 98.4444 < 0.0001
Representation 3 0.1964 0.8989
Dataset Representation 21 0.3954 0.9967
Table 6.7: A two-way analysis of variance for dataset by representation shows one
main eect and no signicant interaction eect.
The next thing to do is to place the results within some context. Table 6.8
contains the best published performance for datasets not constructed for this dis-
sertation. The k-NN classier is outperformed in the Auslan and vowel datasets,
but performs as well or better on the ecg and wafer datasets.
Dataset Performance Citation
Auslan 97.90% Kadous and Sammut (2005)
vowel 96.50% Rodrguez et al. (2005)
ecg 70.97% Weng and Shen (2008)
wafer 98.64% Weng and Shen (2008)
Table 6.8: Best classication accuracy for several of our datasets.
In the previous chapter, we were able to show that classication on the Wubble
World datasets was not a challenging problem since we could shue the order of the
sequences and still perform quite well on the classication task. In Figure 6.1, we
2Auslan was not part of the analysis, nor was the CBA sequence representation.
Source df F p value
Conversion Method 2 392.7431 < 0.0001
Dataset 5 360.8100 0.0000
Conversion Method Activity 10 71.9144 < 0.0001
Representation 3 0.1767 0.9123
Conversion Method Representation 6 0.5112 0.8003
Dataset Representation 15 1.1427 0.3110
Conversion Method Dataset Representation 30 0.4204 0.9977
Table 6.11: A three-way analysis for the data shown in Figure 6.2.
on a two-way analysis of variance with repeated measures, the results of which are
shown in Table 6.13.
starts ends both Allen CBA
M SD M SD M SD M SD M SD
ww 50.41% 14.01 82.89% 6.13 79.69% 9.37 98.89% 2.34
ww2d 90.00% 6.57 87.50% 9.00 92.50% 6.15 95.00% 7.03
HW1 93.21% 3.53 91.99% 3.94 91.73% 3.01 95.90% 2.25 92.37% 3.44
HW2 79.17% 5.28 81.22% 7.66 80.90% 7.24 92.63% 4.28 82.24% 2.85
HW3 76.82% 5.12 78.63% 4.29 80.93% 2.62 89.47% 3.33 88.96% 2.34
Auslan 68.91% 2.46 61.16% 2.38 71.63% 2.38
ecg 34.48% 4.57 37.48% 5.41 34.53% 2.88 61.99% 11.93 70.53% 9.83
wafer 94.56% 2.58 94.22% 2.08 94.64% 3.06 91.11% 2.60
vowel 39.97% 9.62 39.24% 3.61 41.36% 4.35 14.85% 6.49
Table 6.12: The classication results for the CAVE classier on six dierent repre-
sentations. Results are reported from a 10-fold cross validation classication task.
In general, the performance of the CAVE algorithm when trained on sequences
of Allen relations is worse than k-NN on the classication task. We will comment
more thoroughly on this in Section 6.3.
Source df F p value
Dataset 5 171.9580 < 0.0001
Representation 3 33.6624 < 0.0001
Dataset Representation 15 10.2374 < 0.001
Table 6.13: A two-way analysis of variance for dataset by representation shows two
main eects and a signicant interaction eect.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


