Demo Abstract : ThirdEye : Diagnostics and Analytics Framework for Testbed Experiments
Page 1
Demo Abstract : ThirdEye : Diagnostics and Analytics Framework for Testbed Experiments
Demo Abstract: ThirdEye: Diagnostics and Analytics
Framework for Testbed Experiments
Arun Viswanathan
USC/ISI
aviswana@isi.edu
Jelena Mirkovic
USC/ISI
mirkovic@isi.edu
Karen Sollins
MIT
sollins@csail.mit.edu
Abstract—Testbed experiments are distributed by nature, which
makes problem detection, diagnosis and remediation challenging
and time consuming. Large-scale experiments are difficult to
monitor manually and complex experiments fail due to the
smallest misconfiguration or unforeseen problems which
ultimately impact their correctness, repeatability and analysis.
Traditional practices of monitoring, debugging and analyzing
experiments are experiment-specific, consume valuable
experimenter time and increase the barrier for complex
experimentation. We propose ThirdEye, an extensible,
experiment-independent, diagnostics and analytics framework
providing a unified interface to monitor, diagnose, analyze and
visualize testbed experiments.
I. INTRODUCTION
Testbeds for network and security experimentation (like
Deter[1], PlanetLab[2]) provide an efficient and cost-effective
way for running experiments. Such testbeds and experiments
running on them tend to be complex and consist of a large
number of interconnected network, operating system and
application components distributed over large number of
physical systems. Failures in these components during an
experiment impact its correctness, repeatability and analysis.
As an example, consider an experiment where the results of the
experiment will only be correct if the background HTTP traffic
rate is maintained constant. But due to unforeseen failures, the
web-server may crash during the experiment. An experimenter
who did not anticipate this would end up making the wrong
conclusions from his experiment. In many such situations, it is
not only hard to diagnose the cause of failure but also to
ascertain if a failure has occurred. Failures also tend to have
cascading effects and thus it is difficult to correlate multiple
failures manually to localize the original failure. Dependencies
of applications, OS components and physical systems further
make it difficult to share experiments, since this often involves
porting them to a different environment. Using preliminary
results from a survey (currently under progress) of testbed
users we found out that users rate analysis and debugging
activities as their most time-consuming activities on testbeds.
The above discussion identifies two broad problems in
testbed experiments which form the motivation for this work:
Diagnosis and Analysis. Diagnosis deals with identifying the
nature or cause of some phenomenon (failure, problem or
anomaly) while Analysis is the process of extracting
information from the experimental output (in the form of logs
and dumps) that helps the user understand the experiment's
behavior. Diagnosis is hard because: a) Complex and unknown
dependencies exist between the testbed, software and
experiment components and b) Unforeseen and unexpected
failures happen in large complex systems. Analysis is hard
because: a) Experiment data in the form of logs and dumps are
distributed within experiment nodes and across nodes, making
it non-trivial to make correlations and b) Expertise is required
process the vast amounts of distributed data to mine insights,
patterns and anomalies.
Traditional practices of monitoring, diagnosing and
analyzing experiments are experiment-specific and domain-
specific, consume valuable experimenter time and increase the
barrier for complex experimentation. Diagnosis and analysis
are very well researched areas and there is a plethora of
existing work using data mining[5], machine learning[6] and
fault localization[7]. But we see that most of the work is
domain-dependent and relies on specific data formats,
techniques and domain-specific insights to diagnose and
analyze. Such assumptions cannot be made in the case of
generic testbed experiments and thus these techniques cannot
be directly applied to the problem we aim to address.
This work proposes a generic, extensible and experiment-
independent diagnostics and analytics framework called
ThirdEye. Specifically, ThirdEye tries to address the broad
challenges of diagnosis and analysis using a combination of
data mining, machine learning and interactive visualization
techniques on the experiment data (like logs and dumps) to
reveal patterns, insights or problems to the experimenter. As far
as an experimenter is concerned, ThirdEye provides a unified
monitoring, analytics and diagnostics interface.
II. THIRDEYE FRAMEWORK
Experiments produce a plethora of logs distributed within a
node and across nodes during their lifetime. We call such a unit
of experiment activity an Event. Example events are PACKET
events from tcpdump and APPLICATION events from server
logs. Our events are timestamped and carry event details as
attribute-value pairs. Events can be correlated because of the
attributes they contain and the spatial/temporal relations
between events induced by higher-level experiment logic.
Our key intuition is that event correlations enable: a)
Reconstruction of an experiment run by finding patterns (like
loops, repetitions, control-flow) in the event data. b) Failure
detection, discovery and localization. c) Verification of
experiment correctness and d) Extraction of patterns from
experiments which can be converted into experiment
Expectations[4] to ensure future repeatability. The ThirdEye
Framework for Testbed Experiments
Arun Viswanathan
USC/ISI
aviswana@isi.edu
Jelena Mirkovic
USC/ISI
mirkovic@isi.edu
Karen Sollins
MIT
sollins@csail.mit.edu
Abstract—Testbed experiments are distributed by nature, which
makes problem detection, diagnosis and remediation challenging
and time consuming. Large-scale experiments are difficult to
monitor manually and complex experiments fail due to the
smallest misconfiguration or unforeseen problems which
ultimately impact their correctness, repeatability and analysis.
Traditional practices of monitoring, debugging and analyzing
experiments are experiment-specific, consume valuable
experimenter time and increase the barrier for complex
experimentation. We propose ThirdEye, an extensible,
experiment-independent, diagnostics and analytics framework
providing a unified interface to monitor, diagnose, analyze and
visualize testbed experiments.
I. INTRODUCTION
Testbeds for network and security experimentation (like
Deter[1], PlanetLab[2]) provide an efficient and cost-effective
way for running experiments. Such testbeds and experiments
running on them tend to be complex and consist of a large
number of interconnected network, operating system and
application components distributed over large number of
physical systems. Failures in these components during an
experiment impact its correctness, repeatability and analysis.
As an example, consider an experiment where the results of the
experiment will only be correct if the background HTTP traffic
rate is maintained constant. But due to unforeseen failures, the
web-server may crash during the experiment. An experimenter
who did not anticipate this would end up making the wrong
conclusions from his experiment. In many such situations, it is
not only hard to diagnose the cause of failure but also to
ascertain if a failure has occurred. Failures also tend to have
cascading effects and thus it is difficult to correlate multiple
failures manually to localize the original failure. Dependencies
of applications, OS components and physical systems further
make it difficult to share experiments, since this often involves
porting them to a different environment. Using preliminary
results from a survey (currently under progress) of testbed
users we found out that users rate analysis and debugging
activities as their most time-consuming activities on testbeds.
The above discussion identifies two broad problems in
testbed experiments which form the motivation for this work:
Diagnosis and Analysis. Diagnosis deals with identifying the
nature or cause of some phenomenon (failure, problem or
anomaly) while Analysis is the process of extracting
information from the experimental output (in the form of logs
and dumps) that helps the user understand the experiment's
behavior. Diagnosis is hard because: a) Complex and unknown
dependencies exist between the testbed, software and
experiment components and b) Unforeseen and unexpected
failures happen in large complex systems. Analysis is hard
because: a) Experiment data in the form of logs and dumps are
distributed within experiment nodes and across nodes, making
it non-trivial to make correlations and b) Expertise is required
process the vast amounts of distributed data to mine insights,
patterns and anomalies.
Traditional practices of monitoring, diagnosing and
analyzing experiments are experiment-specific and domain-
specific, consume valuable experimenter time and increase the
barrier for complex experimentation. Diagnosis and analysis
are very well researched areas and there is a plethora of
existing work using data mining[5], machine learning[6] and
fault localization[7]. But we see that most of the work is
domain-dependent and relies on specific data formats,
techniques and domain-specific insights to diagnose and
analyze. Such assumptions cannot be made in the case of
generic testbed experiments and thus these techniques cannot
be directly applied to the problem we aim to address.
This work proposes a generic, extensible and experiment-
independent diagnostics and analytics framework called
ThirdEye. Specifically, ThirdEye tries to address the broad
challenges of diagnosis and analysis using a combination of
data mining, machine learning and interactive visualization
techniques on the experiment data (like logs and dumps) to
reveal patterns, insights or problems to the experimenter. As far
as an experimenter is concerned, ThirdEye provides a unified
monitoring, analytics and diagnostics interface.
II. THIRDEYE FRAMEWORK
Experiments produce a plethora of logs distributed within a
node and across nodes during their lifetime. We call such a unit
of experiment activity an Event. Example events are PACKET
events from tcpdump and APPLICATION events from server
logs. Our events are timestamped and carry event details as
attribute-value pairs. Events can be correlated because of the
attributes they contain and the spatial/temporal relations
between events induced by higher-level experiment logic.
Our key intuition is that event correlations enable: a)
Reconstruction of an experiment run by finding patterns (like
loops, repetitions, control-flow) in the event data. b) Failure
detection, discovery and localization. c) Verification of
experiment correctness and d) Extraction of patterns from
experiments which can be converted into experiment
Expectations[4] to ensure future repeatability. The ThirdEye
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
1 Reader on Mendeley
by Discipline
by Academic Status
100% Ph.D. Student
by Country
100% United States


