Sign up & Download
Sign in

A Semantic Framework for Data Analysis in Networked Systems ∗ 1 Introduction 2 Related Work

by Arun Viswanathan, Alefiya Hussain, Jelena Mirkovic, Stephen Schwab, John Wroclawski
NSDI (2011)

Abstract

Effective analysis of raw data from networked systems requires bridging the semantic gap between the data and the users high-level understanding of the system. The raw data represents facts about the system state and analysis involves identifying a set of semantically rel- evant behaviors, which represent interesting relation- ships between these facts. Current analysis tools, such as wireshark and splunk, restrict analysis to the low-level of individual facts and provide limited constructs to aid users in bridging the semantic gap. Our objective is to enable semantic analysis at a level closer to the users understanding of the system or process. The key to our approach is the introduction of a logic-based formulation of high-level behavior abstractions as a sequence or a group of related facts. This allows treating behavior rep- resentations as fundamental analysis primitives, elevat- ing analysis to a higher semantic-level of abstraction. In this paper, we propose a behavior-based semantic anal- ysis framework which provides: (a) a formal language for modeling high-level assertions over networked sys- tems data as behavior models, (b) an analysis engine for extracting instances of user-specified behavior models from raw data. Our approach emphasizes reuse, com- posibility and extensibility of abstractions. We demon- strate the effectiveness of our approach by applying it to five analyses tasks; modeling a hypothesis on traffic traces, modeling experiment behavior, modeling a security threat, modeling dynamic change and composing higher-levelmodels. Finally,we discuss the performance of our framework in terms of behavior complexity and number of input records.

Cite this document (BETA)

Available from Arun Viswanathan's profile on Mendeley.
Page 1
hidden

A Semantic Framework for Data Analysis in Networked Systems ∗ 1 Introduction 2 Related Work

A Semantic Framework for Data Analysis in Networked Systems ∗
Arun Viswanathan† Alefiya Hussain†‡ Jelena Mirkovic† Stephen Schwab‡ John Wroclawski†
† USC/Information Sciences Institute ‡ Sparta Inc.
{aviswana, hussain, mirkovic, jtw}@isi.edu Stephen.Schwab@cobham.com
Abstract
Effective analysis of raw data from networked systems
requires bridging the semantic gap between the data and
the user’s high-level understanding of the system. The
raw data represents facts about the system state and
analysis involves identifying a set of semantically rel-
evant behaviors, which represent “interesting” relation-
ships between these facts. Current analysis tools, such as
wireshark and splunk, restrict analysis to the low-level
of individual facts and provide limited constructs to aid
users in bridging the semantic gap. Our objective is to
enable semantic analysis at a level closer to the user’s
understanding of the system or process. The key to our
approach is the introduction of a logic-based formulation
of high-level behavior abstractions as a sequence or a
group of related facts. This allows treating behavior rep-
resentations as fundamental analysis primitives, elevat-
ing analysis to a higher semantic-level of abstraction. In
this paper, we propose a behavior-based semantic anal-
ysis framework which provides: (a) a formal language
for modeling high-level assertions over networked sys-
tems data as behavior models, (b) an analysis engine for
extracting instances of user-specified behavior models
from raw data. Our approach emphasizes reuse, com-
posibility and extensibility of abstractions. We demon-
strate the effectiveness of our approach by applying it
to five analyses tasks; modeling a hypothesis on traffic
traces, modeling experiment behavior, modeling a se-
curity threat, modeling dynamic change and composing
higher-level models. Finally, we discuss the performance
of our framework in terms of behavior complexity and
number of input records.
∗This work is funded by the Department of Homeland Security and
Space and Naval Warfare Systems Center, San Diego, under Contract
No. N66001-10-C-2018. All findings and conclusions expressed in
this material are those of the authors and do not reflect the views of the
funding agencies.
Part of Alefiya Hussain’s contributions to this paper were while she was
at Sparta Inc.
1 Introduction
The ability to convert raw data into higher-level in-
sights and understanding has become a key enabler in
many fields. We approach one particular aspect of this
problem, namely the analysis of data within the domain
of networked and distributed systems. Such systems rou-
tinely generate a plethora of logs, trace and audit data
during their operation. Users, such as researchers and
system administrators, use this raw data to understand
system behavior, diagnose problems, discover new be-
haviors, or verify hypotheses. Effective analysis of such
raw data requires bridging the semantic gap between raw
data and the user’s high-level understanding of the anal-
ysis domain. Our experience with analysis tools reveals
that this problem is ill-addressed.
A typical approach to data analysis involves the user
sifting through the data using simple search and correla-
tion constructs like boolean queries to identify relation-
ships and infer meaning from data. For example, wire-
shark [19] can help identify complete or incomplete TCP
flows from packet traces and splunk [16] can help iden-
tify spurious logins from a server log. Our study of four
popular tools, discussed in Section 2.1, reveals that cur-
rent approaches require cumbersome multi-step analyses
to infer semantic relationships from data. For example,
a user analyzing a network packet trace may first have to
extract individual flows by specifying specific attribute
values related to each flow, and then somehow manually
infer relationships like concurrency between the flows.
This problem is further complicated if the user has to
reason and analyze over multiple types of data. This sep-
aration between the raw data and the meaning it carries
constitutes the semantic gap.
In this paper, we focus on the problem of express-
ing analyses tasks that are meaningful and useful to the
user. Specifically, given a finite, timestamped list of facts
about the system under observation, our objective is to
assist the user in expressing and modeling semantically
Page 2
hidden
relevant behaviors, which are “interesting” relationships
between these facts or sequence of facts. These relation-
ships encompass notions of ordering, causality, depen-
dence, or concurrency.
Our insight is that higher-level understanding in net-
worked and distributed systems can be expressed in the
form of relationships between system states, simple be-
haviors, and complex behaviors. For example, in most
situations, a typical web-server operation is better un-
derstood as a concurrent relationship between multiple
HTTP sessions to a server rather than the details of the
protocols and specific values in the packet headers. Thus,
our data analysis approach introduces a behavior as a
primitive analysis construct. Behaviors can be extended
or constrained to create a behavior model, which forms
an assertion about the overall behavior of the system. A
behavior model can then be rapidly applied over data to
validate the assertion. We discuss complete details about
specifying behavior models in Section 3, and Section 4
presents the analysis engine for extracting instances of
user-specified behavior models from raw data.
The behavior models are abstract entities to capture
the semantic essence of a particular relationship without
focusing on unnecessary details or particular parameters
that may vary between individual facts or behaviors. In-
corporation of abstract behavior models as explicitly rep-
resented and manipulated constructs within our frame-
work provides two key benefits. First, this abstraction
allows users of our framework to analyze and understand
the raw data at a semantically relevant level. In Sec-
tion 3.4, we introduce an example of a behavior model
to identify pairs of communication events where the des-
tination IP of the second event is same as the source IP of
the first. Such models can be used to analyze many dif-
ferent datasets without any modification. Additionally,
since behavior models are primitive analysis constructs,
the framework supports extensibility by composing new
models from behavior models present in the knowledge
base as demonstrated in Section 5.5. Thus, represent-
ing analysis expertise explicitly as behavior models for-
malizes the semantics for data analysis in networked sys-
tems.
The second key benefit of our work is the ability to
foster sharing and reuse of knowledge embedded in ex-
plicitly represented behavior models. Our first-hand ex-
perience with existing tools suggests that in most cases
knowledge inferred from analysis resides either in a
domain-specific tool or a single expert’s brain. This is
due to a lack of an explicit representation for captur-
ing, storing, sharing, and reusing such knowledge in a
context-independent way. Many current tools are either
static in nature, handling only a fixed set of analyses
and record types, or may offer limited extensibility, but
through some mechanism that involves significant effort.
For example, wireshark [19] is easily extensible using
plugins, but writing a plugin requires understanding the
wireshark API and C programming skills. In contrast,
a well defined shareable format for representing knowl-
edge about networked systems data offers the prospect
that many different tools can be driven by, and contribute
to, a single shared knowledge base.
Beyond the basic challenge, the task of semantic-level
analysis is difficult for two disparate reasons. First, the
definition of “interesting” may vary widely in different
situations, requiring a rich toolbox of techniques for ef-
fective analysis. We address this problem by restricting
the definition of “interesting relationships” to expressing
a particular set of characteristics of networked systems
as discussed in Section 3.1. Second, in large scale sys-
tems, efficient and intelligent data analysis is extremely
resource intensive due to the sheer volume of system
events and traces. While in Section 6 we report perfor-
mance results, this paper primarily discusses the funda-
mental aspects of defining and employing explicit behav-
ior models as a data analysis tool. Real-time analysis of
data for applications such as intrusion detection is a fu-
ture goal as discussed in Section 7.
The fundamental contribution of this paper is the in-
troduction of a behavior-based semantic analysis frame-
work for confirmatory and exploratory analysis of multi-
variate, multi-type, timestamped data captured from net-
worked systems. The main elements of the semantic
framework include (a) a specialized formal language for
specifying behavior models and (b) an analysis engine
for extracting instances of user-specified behavior mod-
els from data. In confirmatory analysis, the user specifies
a validation criteria, expected system behavior or hypoth-
esis, by writing a specific model or through composing a
high-level model from existing models contained within
the knowledge base of the framework. In exploratory
analysis, a user applies existing models from the knowl-
edge base to explore data for new or unanticipated be-
haviors. In Section 5 we present five detailed examples
of how the framework can be applied for these data anal-
ysis tasks.
2 Related Work
In this section, we set the context for our work by first
studying four popular analysis tools followed by a dis-
cussion on specification-based approaches for analysis of
networked systems data.
2.1 Tool Comparison
In this section, we study four popular analysis method-
ologies: wireshark v1.2.7 [19], splunk v4.1 [16], Simple
Event Correlator (SEC) v2.5.3 [18], Bro v1.5.2 [14], and
compare them with our behavior-based semantic anal-
ysis framework (SAF). Both wireshark and splunk are
2
Page 3
hidden
wireshark splunk SEC Bro SAF
System goals Interactive
analysis
Interactive analysis Real-time event
correlation
High-speed, real-time
monitoring
Interactive analysis
Input data Network packets Ascii data from any
source
Ascii data from files,
stdin, pipes
Network packets Any type of data (with
plugin)
Specification
language
Boolean logic Boolean logic Simple language for
specifying rules
Bro scripting language Formal language based
on temporal logic,
interval temporal logic
and boolean logic
Primitive
constructs
Boolean
predicates
Boolean predicates,
unix-like pipelines
and commands
Boolean predicates,
functions written in
Perl
Events (low-level or
higher-level)
Behavior (low-level or
higher-level)
Semantic
constructs
None External commands
can encode
semantics
Perl functions can
encode semantics
Network notions such
as connections, IP
addrs., ports, and
network protocols
Temporal logic and
interval temporal logic
operators for defining
behaviors (Section 3)
Composibility
of specs
None Queries can be
recorded and then
composed into other
queries
Matching events can
trigger creation of new
high-level events
Policies can compose
lower-level events to
generate higher-level
events
Behaviors can be
composed into higher
level behaviors
Abstraction None None Limited Yes Yes
Table 1: Comparison of the behavior-based Semantic Analysis Framework (SAF) with four popular data analysis tools.
mainly interactive analysis tools while Bro and SEC are
real-time monitoring tools. The behavior-based semantic
analysis framework (SAF) falls in the category of inter-
active analysis tools. The tools are compared along seven
dimensions in Table 1; (a) high-level goals, (b) input data
types, (c) analysis specification language (d) primitive
analysis constructs, (e) semantic analysis constructs, (f)
ability to compose specifications and (g) abstraction, that
is, specifications in terms of relationships between data
attributes.
Each paragraph below introduces an analysis frame-
work and the reader is directed to Table 1 for details. The
corresponding features for our framework (SAF) are in-
troduced in Table 1 and explored in future sections. We
have not considered SQL-based approaches on stream-
ing data for comparison [6], since SAF representations
are at a higher-level of abstraction than database query
languages. However, we further discuss how our frame-
work could benefit by using the above SQL extensions to
optimize event storage and retrieval in Section 7.
wireshark [19] is an open-source tool for interactive
analysis of a large variety of network data from a packet
capture file. Wireshark’s design can be separated into
the analysis framework and plugins. The analysis frame-
work provides the ability to sift through large volumes
of packets visually and provides a boolean query gram-
mar for finding “interesting” relationships and statistical
summaries over typical networking concepts, for exam-
ple, rate, flows, bytes, and connections. The plugin archi-
tecture, on the other hand, is responsible for normalizing
and presenting different types of packet data and protocol
behavior to the analysis framework in a uniform way.
splunk [16] is a popular commercial framework for
unified data analysis of a large variety of data. Splunk’s
strength comes from its ability to index various types of
data, allowing the user to sift through logs by combin-
ing search queries using boolean operations, pipes and
powerful statistical and aggregation functions. Splunk
supports time-based, event-based, value-based correla-
tions and also allows combining queries into higher-level
queries. Splunk is extensible using apps, which allow en-
coding knowledge as queries for sharing and wider dis-
semination. However, it does not provide support for ex-
plicitly capturing domain expertise with semantic con-
structs. It does provide the ability to invoke external
commands, thus providing an indirect way to incorpo-
rate explicit domain expertise into the analyses.
Simple Event Correlator(SEC) [18] is an open-
source framework for rule-based event correlation. SEC
reads the analysis specifications from a configuration file
containing a set of event matching rules and correspond-
ing actions. SEC processes data from log files, pipes and
standard streams to trigger the configured actions on a
match. It supports both time-based and event-based cor-
relations and also allows specifying abstract rules that
bind their values at runtime. SEC is more sophisti-
cated than the previous two tools, it supports composing
higher-level events by correlating low-level events, pro-
viding a framework for semantic understanding. Its rule-
types pair and pairwithwindow capture some of the se-
mantics of ordering and duration. However, it lacks sup-
port for inferring interval-based temporal relationships
like concurrency and overlap and the analysis specifica-
tion in the configuration files are not intuitive to capture
3
Page 4
hidden
and share domain expertise in a generic way.
Bro [14] is a high-speed intrusion detection system for
checking security policy violations by passively moni-
toring network traffic in real-time. Bro’s security poli-
cies are written in the specialized Bro scripting language
which is geared towards security analysis. The lan-
guage supports semantic constructs such as connections,
IP addresses, ports, and various network protocols along
with various operators and functions to express different
forms of network analyses. Bro has the ability to do time-
based and event-based correlation. However, Bro mainly
processes network packet data and uses a programming
language-based analysis approach.
2.2 Specification-based Approaches
Specification-based approaches are particularly appeal-
ing in various areas of networked and distributed systems
due to their ability to be abstract, concise, precise, and
verifiable. In formal verification of distributed and con-
current systems, a system is specified in logic and then
formal reasoning is applied on the specification to ver-
ify desired properties [3, 9]. In declarative networking, a
specification language, Network Datalog (NDLog) [10],
allows defining high-level networking specifications for
rapidly specifying, modeling, implementing, and experi-
menting with evolving designs for network architectures.
In testbed-based experimentation, a simple set of user-
supplied expectations are used to validate expected be-
havior of an experiment [12].
The formal specification approaches have been well
developed within the intrusion detection community and
have been successfully applied to network and audit data
for analysis. In this section we first present a brief
overview of four such approaches and then compare
them to SAF.
Roger et al. [15], leverage the idea that attack signa-
tures are best expressed in simple temporal logic using
temporal connectives to express ordering of events. They
pose the detection problem as a model-checking prob-
lem against event logs. Naldurg et al. [13], propose an-
other temporal-logic based approach for real-time mon-
itoring and detection. Their language EAGLE supports
parameterized recursive equations and allows specifying
signatures with complex temporal event patterns along
with properties involving real-time, statistics and data
values. Kinder et al. [8], extend the logic CTL (Computa-
tion Tree Logic) and introduce CTPL (Computation Tree
Predicate Logic) to describe malicious code as a high-
level specification. Their approach allows writing spec-
ifications that capture malware variants. Ellis et al. [4],
introduce a behavioral detection approach to malware by
focusing on detecting patterns at higher-level of abstrac-
tions. They introduce three high-level behavioral signa-
tures which have the ability to detect classes of worms
without needing any apriori information of the worm be-
havior.
The SAF abstract models are comparable to the ap-
proaches of [13, 8, 4] in their use of formal logic and
temporal constructs for specifications. But, in addition
to providing an extended set of sophisticated intuitive op-
erators and constructs, the behavior models presented in
this paper can be generically applied to model various
scenarios over a variety of data and are easily composed
into semantically relevant higher-level models. This al-
lows creating a knowledge base to explicitly capture do-
main expertise required for analyzing a large variety of
operations encountered in networked and distributed sys-
tems as shown in Section 5. The higher-level behav-
ioral signatures [4] based on the network-theoretic ab-
stract communication network (ACN) are tightly bound
to networking constructs like hosts, routers, sensors and
links making them very restrictive in their ability to ex-
press general networked systems behaviors.
The SAF is based on a logic-based specification ap-
proach rather than a programming language-based spec-
ification approach like the one followed in Bro. Our
goal is that the behavior models should be abstract but
also concise and precise to support well-known knowl-
edge representation and reasoning approaches. Logic
is declarative and type-free, imparting formal seman-
tics, abstract specifications, and efficient processing by
analysis engines. The logic-based approach also enables
building a knowledge base of behavior models to explic-
itly capture domain expertise that can be used to auto-
matically reason and infer behavior models. However,
logic-based approaches are less expressive than program-
ming languages. The expressiveness of our approach
is based on requirements derived from characteristics of
networked systems as discussed in Section 3.1.
3 Behavior Models
A particular execution of a networked system or process
can be captured as a sequence of states, where a state
is a collection of attributes and their values. A behav-
ior (b) is a sequence of one or more related states. A
system execution is thus defined as a combination of dif-
ferent behaviors, and each new execution may generate
a unique set of behaviors. A behavior model (φ) is a for-
mula that makes an assertion about the overall behavior
of the system.
For example, consider a simplified IP flow in net-
working, where a flow is a communication between two
hosts identified by their IP addresses. For simplicity
we assume an IP flow to be broken into two states:
ip s2d denotes a packet from some source to destina-
tion host and ip d2s denotes a packet from a destination
to source. Then, a valid IP flow behavior, IPFLOW, is
one where ip s2d and ip d2s are related by their source
4
Page 5
hidden
and destination attributes with the additional criteria that
ip d2s always occurs after ip s2d. The behavior model
(φipflow) is an assertion that IPFLOW is valid. We dis-
cuss details of this example and extend it further in Sec-
tion 3.4.
In this section, we first discuss the requirements and
design choices for a language to specify behaviors fol-
lowed by the formal syntax and semantics of the lan-
guage.
3.1 Requirements
As discussed in Section 1, the key objective of our frame-
work is to enable semantic-level analysis over data. A
semantically expressive language for analysis over net-
worked and distributed systems data must meet the fol-
lowing requirements: (a) enable analysis over multi-
type, multi-variate, timestamped data, (b) express a wide
variety of “interesting” relationships, (c) enable analysis
over higher-level abstractions, and (d) enable composing
abstractions into higher-level abstractions.
The language should express at-least the following
“interesting” relationships to capture the core character-
istics of networked and distributed systems: (a) causal
relationships between behaviors, for example, a file be-
ing opened only if a user is authorized; (b) partial or to-
tal ordering, for example, in-order or out-of-order arrival
of packets; (c) dynamic changes over time, for example,
traffic between client and server drops after an attack on
the server; (d) concurrency of operations, for example,
simultaneous web client sessions; (e) multiple possible
behaviors, for example, a polymorphic worm behavior
may vary on each execution; (f) synchronous or asyn-
chronous operations, for example, some operations need
to complete within a specific time whereas others need
not; (g) value dependencies between operations, for ex-
ample, a TCP flow is valid only if the attribute–values
contained in the individual packets are related to each
other; (h) invariant operations, for example, some opera-
tions may always hold true and, (i) eventual operations,
for example, some operations happen in the course of
time. In addition, we need traditional mechanisms, such
as boolean operators and loops, for combining these re-
lationships into complex behaviors and mechanisms for
basic counting of events and reasoning over the counts.
We do not claim completeness of the above require-
ments but we believe that being able to express the above
classes of primitive relationships and combining them
to form complex relationships would suffice for a wide
range of situations, a few of which we demonstrate as
case studies in Section 5.
3.2 Design
The following four design decisions realize the require-
ments listed above. First, our framework provides logic-
based support to formulate behavior abstractions as a se-
quence or group of related events, where events are uni-
form representation of system facts as discussed later.
This formulation allows treating this behavior represen-
tation as fundamental analysis primitive, elevating anal-
yses to a higher semantic-level of abstraction.
Second, the language combines operators fromAllen’s
interval-temporal logic [1], Lamport’s Temporal Logic
of Actions [9] and boolean logic. Temporal logic allows
expressing the ordering of events in time without explic-
itly introducing time. Interval-temporal logic allows ex-
pressing relationships like concurrency, overlap and or-
dering between behaviors as relationships between their
time-intervals. Additionally, complex behaviors are eas-
ily composed from simpler ones using boolean operators.
Third, the framework enables specifying dependency
relationships between event attributes while leaving the
values to be dynamically populated at runtime. Late
binding enables abstract specifications that enrich the
knowledge base as they can be directly applied to a wide
variety of data-sets. This also enables parametrization of
models during complex model composition as discussed
in Section 5.5.
Lastly, the framework introduces the notion of a
domain-independent event as a uniform representation
of multi-type, multi-variate, timestamped data. Specif-
ically, an event (e) is a representation of system state
and is given by a 4-tuple 〈o, c, t, av〉 where o is the
event-origin (for example, the host IP), c is the event-
type (for example, PKT TCP or APP HTTPD), t is the
event timestamp and av = { 〈ai, vi〉 | ai ∈ A , vi ∈
Strings , 1 ≤ i ≤ Dc } are the attribute-value pairs con-
tained in the event. A is the set of attribute labels, for ex-
ample, sip, dip, etype. Dc is the number of attributes in
an event of type c. This normalization of data to events
ensures that the analysis algorithms are independent of
the input domain.
We believe these design decisions ensure developing
abstract behavior models as first-order primitives for cap-
turing, storing, and reusing domain expertise for the anal-
ysis of networked systems. Next we discuss the syntax
of such a language.
3.3 Syntax
The language grammar for defining a behavior model
φ as a formula, consists of five key elements as shown
in Figure 1: state propositions S as atomic formulae;
grouping operators ‘(’ and ‘)’ to define sub-formulae;
logical operators and temporal operators for relating
sub-formulae or atomic-formulae; the optional behavior
constraints bcon and operator constraints opcon written
within ‘[’ and ‘]’; and the relational operators relop.
A state proposition, S, is an atomic formula for cap-
turing events that satisfy specified relations between at-
5
Page 6
hidden
φ ::= ‘(’ S |φ ‘)’ { bcon }
| notφ (negation)
| φ and φ (logical and)
| φ or φ (logical or)
| φ xor φ (logical xor)
| φ (opcon) φ (leadsto)
| (opcon) φ (always)
| φ olap(opcon) φ (overlaps)
| φ dur(opcon) φ (during)
| φ sw(opcon) φ (startswith)
| φ ew(opcon) φ (endswith)
| φ eq(opcon) φ (equals)
bcon ::= ‘[’ {tc | cc} ‘]’
tc ::= {at | duration | end} relop t{: t}
cc ::= {icount | bcount | rate} relop c{: c}
opcon ::= ‘[’ relop t{: t} ‘]’
relop ::= {> |< |= | ≥ |≤ | 6= }
t ::= [0− 9] + {s|ms}
c ::= [0− 9]+
Figure 1: The grammar for specifying a behavior model φ.
tributes and their values. In essence, S captures states of
a system or process and is the basic element of a behav-
ior model. The most trivial behavior model is one with a
single state proposition. Formally, S is represented as a
finite collection of related attribute-value tuples as:
S = {(ai, ri, vi) | i ∈ N, ai ∈ A, vi ∈ V,
ri ∈ (=, >, <, ≥, ≤, 6=)}
A is a set of string labels, such as sip, dip,
etype and V is a set of string constants, such as
10.1.1.2,/bin/sh, along with two special strings: (a)
strings prefixed with ‘$’, as in $$,$s2.dst (b) strings
with the wild-card character ‘*’, as in /etc/pas*. Con-
sidering our previous example of IPFLOW, the state
propositions ip s2d and ip d2s are written as:
ip s2d = {etype=PKT IP, sip=$$,dip=$$}
ip d2s = {etype=PKT IP,sip=$ip s2d.dip,
dip=$ip s2d.sip}
State proposition ip s2d contains three attributes
etype, sip and dip. etype has a constant value
PKT IP, while sip and dip attributes use the ‘$’ pre-
fixed special variables which are dynamically bound at
runtime. State proposition ip d2s defines the values of
its sip and dip attributes as being dependent on val-
ues of state ip s2d. Dependent attributes along with dy-
namic binding of values allows leaving out details like
the actual IP addresses from the specification.
The temporal operators allow expressing temporal re-
lationships like ordering and concurrency between one-
or-more behaviors. The linear-time temporal operator
(leadsto), written as ∼>, is used to express causal rela-
tionships between behaviors. The interval temporal logic
operators express concurrent relationships between be-
haviors as either relationships: (a) between their start-
times using sw (startswith), (b) between their endtimes
using ew (endswith) or (c) between their durations using
olap (overlap), eq (equals) and dur (during). The 
(always) operator, written as [ ], allows expressing invari-
ant behaviors. The logical operators not, and, or, xor
are supported for logical operations over behaviors and
for creating complex behaviors.
Behavior constraints allow placing additional con-
straints on the matching behavior instances and are spec-
ified immediately following the behavior within square
brackets. Constraints and their values are related using
the standard relational operators. The six behavior con-
straints are divided as time constraints tc and count con-
straints cc. Time constraints allow constraining behav-
ior starttime using at, behavior endtime using end and
behavior duration using duration. The time value, t,
for the constraint can be specified as a single positive
value or as a range. Additionally, the values can be suf-
fixed with either ‘s’ or ‘ms’ to indicate seconds or mil-
liseconds respectively. The count constraints allow con-
straining number of matching behavior instances using
icount, the size of each behavior instance using bcount
and rate of events within a behavior instance using rate.
Operator constraints allow specifying time bounds over
the temporal operators thus allowing their semantics to
be slightly modified. The operator constraint values are
specified as a single value or a range along with a rela-
tional operator. Table 2 presents detailed semantics of
operators along with behavior and operator constraints.
Expressing a behavior in the language constitutes writ-
ing sub-formulae. Behaviors are always enclosed within
parenthesis ‘(’ and ’)’. Simple behaviors are constructed
by relating one-or-more state propositions using opera-
tors, while complex behaviors are constructed by relat-
ing one-or-more behaviors. The grammar also allows
expressing complex behaviors using recursion and we
present an example in Section 5.3. Recursive definitions
allow expressing looping behavior for which the loop
bounds can be optionally specified using the bcount be-
havior constraint. The current grammar does not support
existential and universal quantification since such a need
is not clear. We explore these language extensions as part
of our future work.
Writing behavior models in the framework involves
additional syntax such as namespaces, headers and vari-
ables which are discussed along with the case-studies in
Section 5.1 and Section 5.2. Next section presents the
formal semantics of the language.
3.4 Semantics
We first define two concepts important for understanding
the semantics. A sequential log (L) is a finite sequence
of timestamped events L = e1, e2, e3, . . . , en such that
ei.t ≤ ej .t , ∀ i < j. A behavior instance Bφ for a be-
havior model φ is sequence or groups of events satisfying
6
Page 7
hidden
Behavior model ψ Meaning of ψ L satisfies ψ (L |= ψ) iff
(φ) φ is a behavior. ∃Bφ ⊆ L and |Bφ| > 0
S S is a state proposition defined as
S = {(a1, r1, v1) . . ., (ad, rd, vd)}.
(a) |BS | > 0, (b) ∀ e ∈ BS , ∀ i ∈ {1, . . . , d}, e.ai
is defined and values e.vi and S.vi satisfy relation ri.
(negφ) Negation of behavior is true. L 6|= φ, that is, |Bφ| = 0
(φ1 andφ2) Both φ1 and φ2 are true. L |= φ1 and L |= φ2
(φ1 orφ2) φ1 and φ2 are not both false simultaneously. L |= φ1 or L |= φ1 or satisfies both φ1 and φ2
(φ1 xorφ2) Either of φ1 or φ2 are true but not both. L |= φ1 or L |= φ2 but not both
(φ1 φ2) φ1 leadsto φ2, that is, whenever φ1 is satisfied φ2 will
eventually be satisfied.
(a) L |= φ1 and L |= φ2, (b) Bφ1 [1] 6= Bφ2 [1], (c)
Bφ2 .starttime ≥ Bφ1 .endtime
(φ1 [≤ t]φ2) Whenever φ1 is satisfied φ2 will be satisfied within t
time units.
(a) L |= (φ1 φ2), (b)
Bφ2 .starttime ≤ (Bφ1 .endtime+ t)
(φ) φ is always satisfied, that is, satisfied by each event. ∀ e ∈ L, e |= φ
([= t]φ) φ is always satisfied within every consecutive
interval(epoch) of t time units.
t > 0 and for all consecutive intervals t, lt ⊆ L and
lt |= φ
(φ1 swφ2) φ1 starts with φ2. (a) L |= φ1 and L |= φ2, (b) Bφ1 [1] 6= Bφ2 [1], (c)
Bφ1 .starttime = Bφ2 .starttime
(φ1 sw[≥ t]φ2) φ1 starts t time units after φ2. (a) L |= (φ1 swφ2), (b)
Bφ1 .starttime ≥ (Bφ2 .starttime+ t)
(φ1 ewφ2) φ1 ends with φ2. (a) L |= φ1 and L |= φ2, (b) Bφ1 [1] 6= Bφ2 [1], (c)
Bφ1 .endtime = Bφ2 .endtime
(φ1 ew[= t]φ2) φ1 ends t time units after φ2. (a) L |= (φ1 ewφ2), (b)
Bφ1 .endtime = (Bφ2 .endtime+ t)
(φ1 olapφ2) φ1 overlaps φ2, that is, φ1 starts after φ2 starts but
before φ2 ends and ends after φ2 ends.
(a) L |= φ1 and L |= φ2, (b) Bφ1 [1] 6= Bφ2 [1], (c)
(Bφ2 .starttime < Bφ1 .starttime <
Bφ2 .endtime) and
(Bφ1 .endtime > Bφ2 .endtime)
(φ1 olap[> t]φ2) φ1 overlaps φ2 and the overlapping region is greater
than t time units.
(a) L |= (φ1 olapφ2), (b) the overlap
(Bφ2 .endtime−Bφ1 .starttime) > t
(φ1 eqφ2) φ1 equals φ2 in duration. (a) L |= φ1 and L |= φ2, (b) Bφ1 [1] 6= Bφ2 [1], (c)
Bφ1 .duration = Bφ2 .duration
(φ1 eq[= t]φ2) φ1 and φ2 are both of duration t. (a) L |= (φ1 eqφ2), (b)
Bφ1 .duration = Bφ2 .duration = t
(φ1 durφ2) φ1 occurs during φ2, that is, φ1 starts after φ2 and
ends before φ2 ends.
(a) L |= φ1 and L |= φ2, (b) Bφ1 [1] 6= Bφ2 [1], (c)
(Bφ1 .starttime > Bφ2 .starttime) and
(Bφ1 .endtime < Bφ2 .endtime)
(φ1 dur[= t1 : t2]φ2) φ1 occurs during φ2 with duration between t1 and t2. (a) L |= (φ1 durφ2), (b) (t1 ≤ Bφ1 .duration ≤ t2)
(φ)[icount = c] The number of behavior instances satisfying φ is c. (a) L |= φ, (b) there exist distinct B1φ . . . Bcφ ⊆ L
(φ)[bcount = c] Behavior instances satisfying φ are of size c. (a) L |= φ, (b) Bφ.bcount = c
(φ)[rate > c] Behavior instances satisfying φ have a rate, defined as
(behavior size / behavior duration) greater than c.
(a) L |= φ, (b) (Bφ.bcount/Bφ.duration) > c and
Bφ.duration > 0
(φ)[at < t] Starting time of behavior instances satisfying φ must
be less than absolute time t.
(a) L |= φ, (b) Bφ.starttime < t
(φ)[end ≥ t] Behavior instances satisfying φ have endtime greater
than absolute time t.
(a) L |= φ, (b) Bφ.endtime ≥ t
(φ)[duration 6= t] Behavior instances satisfying φ are of duration 6= t. (a) L |= φ, (b) Bφ.duration 6= t
Table 2: Semantics of operators, behavior constraints and operator constraints in our logic. We describe semantics for constraints considering only
a single relational operator and refer the reader to the framework webpage [17] for details.
the behavior model φ.
Bφ = 〈starttime, endtime, bcount, (b1, b2, . . . , bk)〉
where (b1, b2, . . . , bk) ⊆ L could be an individual
event e or another behavior-instance Bφi . starttime =
b1.starttime is the starting time of the behavior as de-
fined by its first element and endtime = bk.endtime
is the ending time of the behavior as defined by its last
element. bcount = k is the total number of elements
in the behavior instance. All bi’s are in increasing time-
order of their starttime. Additionally, let Bφ.duration
= (Bφ.endtime − Bφ.startime) be the duration of the
behavior instance and |Bφ| = Bφ.bcount represent the
size of behavior instance. If φ is a simple behavior, such
as a state proposition S, then
Bs = 〈ei1 .t, eik .t, k, (ei1 , . . . , eik )〉
7
Page 8
hidden









 
Figure 2: Sequence diagram of IP-interaction between four nodes. →
or← represent an IP packet between a source (s) and destination (d).
An IP flow is a packet pair between s and d.
where (ei1 , . . . , eik) ⊆ L.
Given a finite sequential log L and a user-defined be-
havior model φ, goal of the analysis is to find all behavior
instances (B1φ, B2φ, . . .) from L that satisfy the behavior
model, where satisfiability is defined as follows:
L |= φ iff ∃Bφ ⊆ L and |Bφ| > 0
That is, the log L satisfies (|=) the behavior model φ iff
there exists a behavior instance Bφ in L of finite length
|Bφ|. Since φ is a composite formula created using many
sub-formulas, the satisfiability of φ is determined as a
function of satisfiability of its sub-formulae. Table 2 de-
fines the satisfiability criteria for sub-formulae formed
using the operators and constraints. We next explain the
key language ideas by defining simple models and apply-
ing them to a fictitious data set.
Assume a packet trace of seven IP packets represent-
ing an interaction between four nodes A, B, C and D as
shown in Figure 2. Let the sequential log of correspond-
ing events be e1, e2, . . . , e7.
Using the states ip s2d and ip d2s defined earlier
in Section 3.3, IP flow behavior is written as a causal
relationship between the state propositions ip s2d and
ip d2s as IPFLOW=(ip s2d ip d2s). There are
three IP flow instances in Figure 2 that satisfy IPFLOW,
that is, icount = 3 with bcount = 2 for each instance:
B1ipflow = (e1, e7)
B2ipflow = (e2, e5)
B3ipflow = (e3, e4)
Extending the example, a complex behavior for
pairs of overlapping IP flows can now be written as
IPFLOW PAIRS=(IPFLOW olap IPFLOW). There are
in all three instances of overlapping IPFLOW pairs from
Figure 2. That is,
B1ipflow pairs = ((e1, e7), (e2, e5))
B2ipflow pairs = ((e1, e7), (e3, e4))
B3ipflow pairs = ((e2, e5), (e3, e4))
Again, icount = 3 and for each instance bcount = 2,
since bcount counts the number of IPFLOW occurrences
and not individual events.
We can additionally define a bad IP flow behav-
ior BAD IPFLOW as one for which there was no
matching response from the destination. That is,
BAD IPFLOW=(ip s2d (not ip d2s)). Event
e6 matches BAD IPFLOW model since it has no matching
response. That is, B1bad ipflow = (e6), with bcount = 1.
The next section describes the architecture of the anal-
ysis framework.
4 Semantic Analysis Framework
Given our objective of semantic-level data analysis, we
require the analysis framework to support (a) analysis
of multi-type, multi-variate, timestamped data, (b) defin-
ing new models by composing existing models, and (c)
storage, retrieval and extensibility of domain-specific be-
havior models. The framework has five components as
shown in Figure 3; the knowledge base, a data normal-
izer, an event storage system, an analysis engine and a
presentation engine. The decoupling of behavior model
specification, the input processing and the analysis al-
gorithms, allows the framework to be directly applied
across several different domains. Subsequent sections
discuss the details of each component.
4.1 Knowledge Base
The knowledge base provides a namespace-based stor-
age mechanism to store behavior models and is central
in providing an extensible framework. For example, our
networking domain currently defines models for ipflow,
tcpflow, icmpflow and udpflow. These behavior models
capture common domain information and allow a user
to rapidly compose higher-level models by reusing exist-
ing behavior models. Reusing a behavior model from the
knowledge base constitutes importing it using its names-
pace and name. For example, referring to the behavior
model in Figure 4(a), line 5 imports the IPFLOW model
from the NET.BASE PROTO domain. The namespace al-
lows categorization of models into domain-specific areas
while allowing composition of models across domains.
We implement namespaces similar to Java namespaces,
that is, each component in the namespace corresponds to
a directory name on the filesystem. This simple design
ensures that the knowledge base is easily customizable
and extensible.
4.2 Data Normalizer
The data normalizer maps a data record to the event for-
mat defined in Section 3.2. Raw data accepted by the nor-
malizer can be in the form or trace files, packet dumps,
audit logs, security logs, syslogs, kernel logs or script
output with the only requirement that each data record
have a timestamp and a message field. Specialized plug-
ins in the normalizer convert each type of raw data into
corresponding events. Figure 3(b) shows a possible event
8
Page 13
hidden
of the next scan to be the same as the previously infected
host. The forward-dependent attribute src is initialized
automatically the first time single spread is parsed by
considering it to be a dynamic ($$) variable. The next
iteration over spread chain then uses the values as de-
termined dynamically by single spread.
5.4 Modeling Dynamic Change
Dynamic changes are a fundamental characteristic of
networked and distributed environments. One example
of a dynamic change is the change in rate of a stream
of packets due to an anomalous condition such as a DoS
attack. Our objective in this case study is to model an
expected reduction in the rate of legitimate HTTP traffic
due to DoS attack on a server. Our raw data consists of
IDS DoS attack alerts and HTTP packets.
The DYNAMIC CHANGE model, containing only the
relevant aspects is described in Figure 7(b). Line 2 de-
fines a state capturing a HTTP packet between a source
and destination. Line 3 defines a state capturing a DoS
attack alert, additionally requiring the destination to be
same as the destination in the HTTP packet. Lines 4 and
5 describe the HTTP packet stream rates before and af-
ter the attack respectively. The change boundary is de-
fined by the attack event that is triggered once the
attack starts. Since attack event represents a single
event, it has the same starttime and endtime. Line 6 use
the ew (endswith) operator to define the attack start
condition, which specifies that the http stream at100
behavior end within five seconds of the attack event.
The DYNAMIC CHANGE model is then an assertion that
the HTTP stream rate reduces following the attack.
5.5 Composing Models
Our final case study demonstrates the ease of compos-
ing and extending existing models to define semantically
relevant higher-level behavior.
We combine our previously defined mod-
els DNSKAMINSKY and WORMSPREAD to create a
COMBINED ATTACK scenario as shown in Figure 7(c).
Line 2 captures the behavior where a worm infects a
host machine and scans and infects another host. Line
3 describes the behavior where the worm launches a
DNS Kaminsky attack on some DNS server from the
last infected host. We do not specify any server for the
DNS Kaminsky attack due to the abstractness of the
DNSKAMINSKY model which infers the destination dy-
namically. Line 4 is the final behavior model combining
both the attacks. In line 3, we only constrain the sip
and leave other attributes unspecified. This demonstrates
the ability to extend the imported models with only
the desired attribute values while leaving the others as
defined in the imported model.
0
10
20
30
40
50
60
0 10000 20000 30000 40000 50000 60000 70000 80000
R
un
tim
e
(m
in
ut
es
)
Total Events Processed
b1 = cState
b2 = iState
b3 = iState ~> iState
b4 = iState ~> dState
b5 = iState ~> dState ~> dState ~> dState
Figure 8: Plot of runtime against number of events for five types
of behavior complexity. Behaviors containing dependent value states
(dStates) result in quadratic complexity.
6 Performance Analysis
A common approach for semantic-level analysis involves
use of custom scripts or tools encoding context-specific
semantics. Since custom scripts and tools can be written
using a variety of programming and optimization tech-
niques, any evaluation of our generic framework against
them would be very subjective and thus flawed. Instead,
we choose to report the raw runtime performance of our
prototype implementation on five basic analyses tasks
over event datasets of increasing size.
The runtime performance of the framework depends
on the language constructs, input data, analysis algorithm
and implementation mechanisms used. Since our pri-
mary focus in this paper is on enabling semantic func-
tionality, we prototyped the framework in Python using a
SQLite database as backend for storing events. The input
events used were PKT DNS events collected for the case
study in Section 5.2. The performance analysis was con-
ducted on a laptop with an Intel Pentium-M processor
running at 1.86 GHz and with a memory of 2 GB.
We measure runtime as a function of two variables:
(a) the number of events input to the algorithm, (b) the
behavior complexity, defined as the processing complex-
ity of state propositions in a behavior formula. As dis-
cussed in Section 3.3, there are three types of state propo-
sitions based on attribute assignments; constant value at-
tributes denoted as cState, dependent value attributes de-
noted as dState, and dynamic attribute values denoted as
iState. These states can be combined to form five ba-
sic behaviors, each representing a basic semantic anal-
ysis task: b1 = (cState), represents extracting events
with known attributes and values; b2 = (iState), repre-
sents extracting events with particular attributes but un-
known values; b3 = (iState iState), represents extract-
ing causally correlated yet value-independent events; b4
= (iState dState), represents extracting causally cor-
related and value-dependent events; and b5 = (iState
13
Page 14
hidden
dState dState dState), represents extracting a long
chain of causal events. Although we limit our analysis
to the operator, all operators incur uniform process-
ing overhead in the algorithm, thus resulting in similar
performance results. The chosen event set along with the
behaviors are representative of a worst-case input to the
framework. We measure the performance using above
behaviors over event sets in increments of 10,000 events.
We stop at the event set when runtime exceeds 60 min-
utes.
The results are averaged over three runs and are shown
in Figure 8. The plots for behaviors consisting of cStates
and iStates b1, b2 and b3 tend to be linear as discussed in
Section 4.4. One would expect that behavior b5, contain-
ing three dStates would show significantly higher run-
time than behavior b4 containing only one dState. Both
show quadratic performance, since, in a chain of depen-
dent states, the states further in the chain process lesser
events than states in front of the chain. We thus see that
runtime quickly becomes quadratic given a worst-case
set of events and behaviors containing dependent state
propositions. The current Python and SQLite-based im-
plementation also add penalty to the framework runtime.
We investigate these issues as part of our future work.
7 Conclusion and Future Work
In this paper, we presented a behavior-based semantic
analysis framework that allows the user to analyze data
at a higher-level of abstraction. Typically, system experts
rely on their intuition and experience to manually ana-
lyze and categorize scenarios and then hand-craft rules
and patterns for analysis. Hence due to the manual and
ad-hoc nature of this analysis process, there is limited
extensibility and composibility of analysis strategies. In
this paper we show that our approach is more system-
atic, can retain expert knowledge, and supports compos-
ing behaviors from existing models. We evaluated the
utility of our framework against five analyses scenarios
which demonstrated the ease with which a user’s higher-
level understanding of system operation was expressed
as behavior models over data.
Our future work includes investigating the scale and
efficiency issues that arise during processing large vol-
umes of data in both offline and real-time settings like in-
trusion detection. We will investigate stream-based SQL
query extensions [6] to improve performance. We will
also investigate extending our logic with existential and
universal quantifiers. Currently, our framework requires
a user to either manually specify behavior models or use
existing models from the knowledge base to explore data.
To further exploratory analysis, we would need to alert
users to interesting unanticipated behaviors. We are ex-
ploring data mining algorithms to automatically discover
and compose behavior models from data.
The fundamental goal of the behavior-based semantic
analysis framework is to introduce a semantic approach
to data analysis in networked and distributed systems re-
search and operations. We hope that this paper serves as
a catalyst for further research on semantic data analysis.
References
[1] ALLEN, J. Maintaining Knowledge about Temporal Intervals.
Communications of the ACM 26, 11 (Nov. 1983), 832–843.
[2] BENZEL, T., BRADEN, R., KIM, D., NEUMAN, C., JOSEPH,
A., SKLOWER, K., OSTRENGA, R., AND SCHWAB, S. Experi-
ence with DETER: A Testbed for Security Research. In 2nd Intl.
Conf. on Testbeds and Research Infrastructures for the Devel. of
Networks and Communities - TRIDENTCOM (2006), p. 10.
[3] BE´RARD, B. Systems and Software Verification: Model-checking
Techniques and Tools. Springer, 2001.
[4] ELLIS, D. R., AIKEN, J. G., ATTWOOD, K. S., AND
TENAGLIA, S. D. A Behavioral Approach to Worm Detection.
In Proc. of the ACM workshop on Rapid malcode (2004), pp. 43–
53.
[5] HUSSAIN, A., HEIDEMANN, J., AND PAPADOPOULOS, C. A
Framework For Classifying Denial of Service Attacks. Proc. of
the Conf. on Applications, Technologies, Architectures, and Pro-
tocols for Comp. Comm. - SIGCOMM (2003), 99.
[6] JAIN, N., MISHRA, S., SRINIVASAN, A., GEHRKE, J.,
WIDOM, J., BALAKRISHNAN, H., C¸ETINTEMEL, U., CHERNI-
ACK, M., TIBBETTS, R., AND ZDONIK, S. Towards a Streaming
SQL Standard. Proc. VLDB Endow. 1 (August 2008), 1379–1390.
[7] KAMINSKY, D. Multiple DNS Implementations Vulnerable to
Cache Poisoning. http://www.kb.cert.org/vuls/id/800113, 2008.
[8] KINDER, J., KATZENBEISSER, S., SCHALLHART, C., AND
VEITH, H. Detecting Malicious Code by Model Checking. In In-
trusion and Malware Detection and Vuln. Assessment, K. Julisch
and C. Kruegel, Eds., vol. 3548 of Lecture Notes in Computer
Science. Springer Berlin / Heidelberg, 2005, pp. 174–187.
[9] LAMPORT, L. The Temporal Logic of Actions. ACM Trans.
Program. Lang. Syst. 16, 3 (1994), 872–923.
[10] LOO, B. T., CONDIE, T., GAROFALAKIS, M., GAY, D. E.,
HELLERSTEIN, J. M., MANIATIS, P., RAMAKRISHNAN, R.,
ROSCOE, T., AND STOICA, I. Declarative Networking: Lan-
guage, Execution and Optimization. In Proc. of ACM SIGMOD
(2006), pp. 97–108.
[11] Metasploit Framework Website. http://www.metasploit.com/.
[12] MIRKOVIC, J., SOLLINS, K., AND WROCLAWSKI, J. Managing
the Health of Security Experiments. In Proc. of the conf. on Cyber
Security Experimentation and Test (2008), USENIX, pp. 7:1–7:6.
[13] NALDURG, P., SEN, K., AND THATI, P. A Temporal Logic
Based Framework for Intrusion Detection. In Proc. of the 24th
IFIP Intl. Conf. on Formal Tech. for Net. & Dist. Sys. (2004).
[14] PAXSON, V. Bro: A System for Detecting Network Intruders in
Real-time. Comput. Networks 31, 23-24 (1999), 2435–2463.
[15] ROGER, M., AND GOUBAULT-LARRECQ, J. Log Auditing
through Model-Checking. In Proc. of the 14th IEEE Computer
Security Foundations Workshop (2001), pp. 220–236.
[16] Splunk Website. http://www.splunk.com/.
[17] Semantic Analysis Framework Website.
http://thirdeye.isi.deterlab.net/.
[18] VAARANDI, R. SEC - A Lightweight Event Correlation Tool.
IEEE Workshop on IP Operations and Management (2002), 111
– 115.
[19] Wireshark Website. http://www.wireshark.org/.
14

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

10 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
50% Ph.D. Student
 
20% Student (Bachelor)
 
10% Student (Master)
by Country
 
50% United States
 
30% China
 
10% Japan