Sign up & Download
Sign in

Detecting Network Disruptions with Network-wide Analysis

by Y Huang, N Feamster, A Lakhina, J Xu
ACM SIGMETRICS (2007)

Cite this document (BETA)

Available from www.gtnoise.net
Page 1
hidden

Detecting Network Disruptions with Network-wide Analysis

Diagnosing Network Disruptions with Network›Wide Analysis
Yiyi Huang∗, Nick Feamster∗, Anukool Lakhina†, Jun (Jim) Xu∗
∗ College of Computing, Georgia Tech, † Guavus, Inc.
ABSTRACT
To maintain high availability in the face of changing network condi-
tions, network operators must quickly detect, identify, and react to
events that cause network disruptions. One way to accomplish this
goal is to monitor routing dynamics, by analyzing routing update
streams collected from routers. Existing monitoring approaches
typically treat streams of routing updates from different routers as
independent signals, and report only the loud events ( i.e., events
that involve large volume of routing messages). In this paper, we
examine BGP routing data from all routers in the Abilene backbone
for six months and correlate them with a catalog of all known dis-
ruptions to its nodes and links. We nd that many important ev ents
are not loud enough to be detected from a single stream. Instead,
they become detectable only when multiple BGP update streams
are simultaneously examined. This is because routing updates ex-
hibit network-wide dependencies.
This paper proposes using network-wide analysis of routing in-
formation to diagnose (i.e., detect and identify) network disrup-
tions. To detect network disruptions, we apply a multivariate anal-
ysis technique on dynamic routing information, (i.e., update traf c
from all the Abilene routers) and nd that this technique can detect
every reported disruption to nodes and links within the network
with a low rate of false alarms. To identify the type of disruption,
we jointly analyze both the network-wide static con gurati on and
details in the dynamic routing updates; we nd that our metho d can
correctly explain the scenario that caused the disruption. Although
much work remains to make network-wide analysis of routing data
operationally practical, our results illustrate the importance and po-
tential of such an approach.
Categories and Subject Descriptors
C.2.6 [Computer Communication Networks]: Internetworking;
C.2.3 [Computer Communication Networks]: Network opera-
tions network management
General Terms
Algorithms, Management, Reliability, Security
Keywords
anomaly detection, network management, statistical inference
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for pro t or commercial advantage an d that copies
bear this notice and the full citation on the rst page. To cop y otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speci c
permission and/or a fee.
SIGMETRICS’07, June 12 16, 2007, San Diego, California, USA.
Copyright 2007 ACM 978›1›59593›639›4/07/0006 ...$5.00.
destination
A
B
before
after
(a) Internal disruption
destination
A
after
before
B
(b) External disruption
Figure 1: Both internal and external network disruptions cause corre-
lated routing changes at groups of routers within a single network.
1. Introduction
To achieve acceptable end-to-end performance in the face of dy-
namic network conditions (e.g., traf c shifts, link failures, security
incidents, etc.), network operators must keep constant watch over
the status of their networks. Network disruptions changes in net-
work conditions that are caused by underlying failures of routing
protocols or network equipment have a signi cant impact on net-
work performance and availability. Operators today have myriad
datasets (e.g., NetFlow, SNMP, syslogs ) at their disposal to mon-
itor for network disruptions, all of which have proven dif c ult to
use for extracting actionable events from background nois e . Op-
erators have had particular trouble using routing data to detect and
pinpoint network disruptions, even though analyzing routing data
holds promise for exposing many important network reachability
failures. This missed opportunity results from the fact that routing
data is voluminous, complex and noisy, which makes the mining of
network disruptions challenging.
Existing approaches for inspecting routing dynamics in a single
network (e.g., [25, 30]) primarily analyze each routing stream with-
out considering the dependencies across multiple routing streams
that arise from the network con guration and topology. This ap-
proach leaves much room for improvement, because any informa-
tion about network disruptions that exists in a single routing update
stream is obscured by a massive amount of noise. Furthermore, no
network model can explain the temporal relationships among up-
dates in a single routing stream, since the updates have little (and
often no) temporal dependency. As such, these techniques are un-
able to capture typical network conditions to recognize disruptions,
and therefore rely on xed thresholds to detect only those ev ents
that cause a large number of updates. But, as we will see in this pa-
per, many important operational events do not necessarily generate
a large number of updates at a single router. To detect such opera-
tional events, it is necessary to rst continuously monitor and learn
the typical routing dynamics of the network; deviations from this
typical behavior indicate a routing incident worth investigating.
This paper proposes a new approach to learning typical routing
Page 2
hidden
dynamics by explicitly harnessing the network-wide dependencies
that are inherent to the routing updates seen by routers in a single
network. Groups of updates from different routers, when analyzed
together, re ect dependencies arising from the network top ology
and static routing con guration: routers’ locations in the network
topology relative to each other, how they are connected to one an-
other, the neighboring networks they share in common, etc. For ex-
ample, Teixeira et al. observed that the failure of a single link inside
a network may result in multiple routers simultaneously switching
egress routers ( i.e., the router used to exit the network) [28] (Fig-
ure 1(a)); similarly, the failure of a single BGP peering session
results in similar correlated disruptions across the network (Fig-
ure 1(b)). Because of these dependencies, network disruptions can
appear signi cant when the effect of the event is viewed acro ss all
of the routers in the network, even if the number of updates seen by
any single router is small.
This paper presents the rst known study of network-wide cor re-
lation of routing updates in a single network, demonstrates that de-
tection schemes should incorporate network-wide analysis of rout-
ing dynamics, and explores the extent to which multivariate anal-
ysis could expose these events. Table 1 summarizes the major
ndings of this paper, which presents the following contrib utions:
First, we study how actual, documented network disruptions
are re ected in routing data . Several previous studies examine
how BGP routing updates correlate with poor path performance [5,
13, 29], but these studies do not correlate BGP instability with
ground truth , known disruptions ( e.g., node and link failures) in
an operational network. Our work examines how known, docu-
mented network disruptions are re ected in the BGP routing data
within that network. We perform a joint analysis of documented
network component failures in the Abilene network and Abilene
BGP routing data for six months in 2006 and nd that most net-
work disruptions are re ected in BGP data in some way, though
often not via high-volume network events.
Second, we explore how network-wide analysis can expose
classes of network disruptions that are not detectable with ex-
isting techniques. After studying how known disruptions appear
in BGP routing data, we explore how applying multivariate analy-
sis techniques, which are speci cally designed to analyze m ultiple
statistical variables in parallel, could better detect these disruptions.
We explore how applying a speci c multivariate analysis tec hnique,
Principal Component Analysis (PCA), to routing message streams
across the routers in a single network can extract network events
that existing techniques would fail to detect.
Third, we present new techniques for combining analysis of
routing dynamics with static con guration analysis to loca lize
network disruptions. In addition to detecting failures, we develop
algorithms to help network operators identify likely failure scenar-
ios. Our framework helps network operators explain the source of
routing faults by examining the semantics of the routing messages
involved in a group of routing updates in conjunction with a model
of the network, derived from static con guration analysis. This
hybrid analysis approach is the rst known framework for using a
combination of routing dynamics and static routing con gur ation to
help operators detect and isolate the source of network disruptions.
Previous work has taken on the audacious goal of Internet-wide
root cause analysis [3, 8, 31], but all of these techniques have
faced two fundamental limitations: lack of information in any
single routing stream and poor knowledge of global router-level
topology. In this work, we recommend revisiting the use of BGP
routing data within a single network using multiple data streams,
where correlations across streams can provide additional informa-
Finding Location
Many network disruptions cause only low vol-
umes of routing messages at any single router.
§3.2, Fig. 5
About 90% of local network disruptions are vis-
ible in BGP routing streams.
§4.1, Fig. 8
The number of updates resulting from a disrup-
tion may vary by several orders of magnitude.
§4.2, Fig. 6
About 75% of network disruptions result in near-
simultaneous BGP routing messages at two or
more routers.
§4.3, Fig. 8
The PCA-based subspace method detects 100%
of node and link disruptions and about 60% of
disruptions to peering links, with a low rate of
false alarms.
§5.3, Tab. 3
The identi cation algorithm based on hybrid
static and dynamic analysis correctly identi es
100% of node disruptions, 74% of link disrup-
tions, and 93% of peer disruptions.
§6.3, Fig. 11
Table 1: Summary of major results.
tion about the nature of a failure, and access to network con gura-
tions can provide valuable information about the network topology
(e.g., the routers that have connections to a particular neighbor-
ing network). Our goal is not primarily to evaluate or optimize
a speci c multivariate analysis technique ( e.g., PCA), but rather
(1) to explore the nature of how disruptions in a single network
are re ected network-wide and temporally in BGP routing dat a,
(2) to argue in general for the utility of using network-wide analy-
sis techniques for improving detection of network disruptions and
(3) to demonstrate how, once detected, network models based on
static routing con gurations can help operators detect and isolate
the cause of these disruptions.
Many hurdles must be surmounted to make our methods practi-
cal, such as (1) building a system to collect and process distributed
routing streams in real time; and (2) determining the features in
each signal that are most indicative of high-impact disruptions (we
use number of updates, as most existing methods do, but we be-
lieve that more useful features may exist). Rather than providing
the last word on analysis of routing dynamics, this paper opens
a new general direction for analyzing routing data based on the
following observation: The structure and con guration of the net-
work gives rise to dependencies across routers, and any analysis
of these streams should be cognizant of these dependencies, rather
than treating each routing stream as an independent signal. In ad-
dition, we believe that our combined use of static and dynamic anal-
ysis for helping network operators identify the cause and severity
of network disruptions represents an important rst step in bridg-
ing the gap between static con guration analysis and monito ring of
routing dynamics.
2. Background
We now present necessary background material. We rst de-
scribe the general problems involved in using routing dynamics
to detect and identify network disruptions. Then, we explain how
changes to conditions within a single network can give rise to rout-
ing dynamics that exhibit network-wide correlations across multi-
ple routing streams.
2.1 Problem Overview and Approach
Diagnosis entails two complementary approaches: proactive tech-
niques, which analyze the network con guration (either sta ti-

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

1 Reader on Mendeley
by Discipline
 
by Academic Status
 
100% Assistant Professor
by Country
 
100% United States

Groups

Web Page Pubs