Measuring the effects of internet path faults on reactive routing
- ISSN: 01635999
- ISBN: 1581136641
- DOI: 10.1145/885651.781043
Abstract
Empirical evidence suggests that reactive routing systems improve resilience to Internet path failures. They detect and route around faulty paths based on measurements of path performance. This paper seeks to understand why and under what circumstances these techniques are effective. To do so, this paper correlates end-to-end active probing experiments, loss-triggered traceroutes of Internet paths, and BGP routing messages. These correlations shed light on three questions about Internet path failures: (1) Where do failures appear? (2) How long do they last? (3) How do they correlate with BGP routing instability? Data collected over 13 months from an Internet testbed of 31 topologically diverse hosts suggests that most path failures last less than fifteen minutes. Failures that appear in the network core correlate better with BGP instability than failures that appear close to end hosts. On average, most failures precede BGP messages by about four minutes, but there is often increased BGP traffic both before and after failures. Our findings suggest that reactive routing is most effective between hosts that have multiple connections to the Internet. The data set also suggests that passive observations of BGP routing messages could be used to predict about 20% of impending failures, allowing re-routing systems to react more quickly to failures.
Measuring the effects of internet path faults on reactive routing
Reactive Routing
Nick Feamster, David G. Andersen, Hari Balakrishnan, and M. Frans Kaashoek
MIT Laboratory for Computer Science
200 Technology Square, Cambridge, MA 02139
{feamster,dga,hari,kaashoek}@lcs.mit.edu
Abstract
Empirical evidence suggests that reactive routing systems im-
prove resilience to Internet path failures. They detect and route
around faulty paths based on measurements of path perfor-
mance. This paper seeks to understand why and under what cir-
cumstances these techniques are effective.
To do so, this paper correlates end-to-end active probing ex-
periments, loss-triggered traceroutes of Internet paths, and BGP
routing messages. These correlations shed light on three ques-
tions about Internet path failures: (1) Where do failures appear?
(2) How long do they last? (3) How do they correlate with BGP
routing instability?
Data collected over 13 months from an Internet testbed of 31
topologically diverse hosts suggests that most path failures last
less than fifteen minutes. Failures that appear in the network core
correlate better with BGP instability than failures that appear
close to end hosts. On average, most failures precede BGP mes-
sages by about four minutes, but there is often increased BGP
traffic both before and after failures. Our findings suggest that
reactive routing is most effective between hosts that have multi-
ple connections to the Internet. The data set also suggests that
passive observations of BGP routing messages could be used
to predict about 20% of impending failures, allowing re-routing
systems to react more quickly to failures.
Categories and Subject Descriptors
C.2.6 [Computer-Communication Networks]: Internetwork-
ing; C.4 [Performance of Systems]: Measurement Techniques
General Terms
Measurement, Performance, Reliability, Experimentation
1. Introduction
The prevalence of faults in the IP substrate results in frequent
performance degradations on Internet paths. These faults oc-
cur for a variety of reasons, including physical link disconnec-
tion [7], software errors [6], and router misconfiguration [13].
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMETRICS’03, June 10–14, 2003, San Diego, California, USA.
Copyright 2003 ACM 1-58113-664-1/03/0006 ...$5.00.
Faults cause path failures (outages), increase the volume of
routing traffic, and trigger route oscillations and path fluttering.
Their effects are visible to end hosts as broken connections, ex-
cessive packet loss, and rapidly varying path quality.
A number of recent proposals to improve the availability of
wide-area Internet connectivity use reactive routing. In this ap-
proach, reactive routing systems measurements of network-layer
path characteristics such as reachability, loss, and latency, to
choose better paths. Most reactive routing systems use a com-
bination of active probes and passive traffic monitoring to de-
cide which paths are better; the differences are in how they
take advantage of alternate paths. Resilient Overlay Networks
(RON) [2] and Akamai’s SureRoute [15] re-route data packets
over an overlay network. These overlays (Figure 1) treat the un-
derlying Internet path between two nodes as a single link, form-
ing a higher-layer path through these nodes. Other systems use
routing changes to select between paths at the IP layer [17, 20,
21]. Empirical evidence suggests that such schemes often work
well at masking path failures, outages, and periods of extreme
congestion [2].
This paper addresses why and under what circumstances re-
active routing is able to overcome path failures by asking three
specific questions:
Where do failures appear? To mask path failures, destina-
tions on fault-prone paths must be reachable by alternate paths
that fail independently. To determine how well reactive routing
can mask failures, we must understand where failures occur in
the Internet.
How long do failures last? A routing system that takes
longer to react to a failure than the duration of the failure will
not meet its goals. To understand how reactive a system must
be, we must understand how long path failures last.
How do failures correlate with routing protocol messages?
In situations where BGP instability correlates with poor path
performance or path failures, BGP messages can serve as an in-
dicator of poor path performance. In these cases, reactive sys-
tems could detect failures with fewer active probes than would
otherwise be necessary. If BGP instability precedes path fail-
ures, reactive routing might even proactively route around some
failures before they occur. Using BGP information might there-
fore reduce the reaction time of reactive routing systems.
To answer these questions, we analyze one year of data col-
lected on a geographically and topologically diverse testbed of
31 hosts. The paths between these hosts traverse more than 50%
of the well-connected autonomous systems (AS’s) on the Inter-
1
Figure 1: Routing around Internet path failures with over-
lays. The success of overlays and other reactive routing
schemes depends on where failures are occurring and how
quickly an alternate path is discovered relative to the dura-
tion of the failures. Predicting path failures could allow such
systems to pre-emptively route around them.
net. The data includes correlated probes, where active probes
between hosts discover one-way path failures lasting longer than
2 minutes and trigger traceroutes along these paths when a fail-
ure is discovered. We then correlate these observed failures with
BGP routing information collected at eight monitoring hosts at
the same time.
Our method doesn’t pin down where a failure occurs; rather,
it captures the location where a failure appears. The IP rout-
ing substrate reacts to faults it detects by sending routing up-
dates that alter the flow of traffic. Because of this response, the
location where a traceroute observes a failure may not be the
place where the actual failure occurred and may change with
time. Because reactive routing systems must be able to route
around where IP routing failures appear rather than where the
faults may have originally occurred, this analysis is appropriate
for reactive routing systems.
To discover where failures appear, we present techniques for
assigning failures to a particular router, and assigning that router
to an AS. Using these techniques, we examine where and how
long failures appear. Finally, we investigate correlations be-
tween path failures and routing instability, as observed from co-
located BGP route monitors. Table 1 summarizes the major re-
sults in this paper.
In Section 2 we discuss our data collection methods. Sec-
tion 3 discusses two algorithms that are central to the data anal-
ysis: alias resolution and AS assignment, and failure detection
and assignment. Section 4 discusses failure location and dura-
tion and their effects on reactive routing techniques. Section 5
presents observations of temporal correlations between path fail-
ures and BGP messages and suggests how BGP messages could
be used to detect and predict path failures. Section 6 surveys
relevant related work, and Section 7 concludes.
2. Data collection
We performed measurements between February 2002 and
March 2003 on 31 NTP-synchronized nodes in the RON
testbed [19], listed in Table 2; during this period, we collected
about 60 GB of active probes, BGP messages, and failure-
triggered traceroutes. The topology generated by the pairwise
paths between these nodes, as measured by traceroute, covers 71
AS’s. The testbed topology contains paths that traverse most of
the “large” AS’s in the Internet. To rank the AS’s by size, we
While a few paths are much more failure-prone
than others, failures appear spread out over many
different links, not just a few “bad” links.
Fig. 5
and 6
Failures appear more often inside AS’s than on
links between them. Table 7
90% of failures last less than 15 minutes, and 70%
of failures last less than 5 minutes. Fig. 7
BGP messages coincide with only half of the fail-
ures that reactive routing could potentially avoid,
suggesting that these were failures that not even a
“perfect” BGP could avoid.
Table 8
Reactive routing is potentially more effective at
correcting failures for hosts with multiple Internet
connections.
Sec. 4.3
BGP traffic is a good indicator that a failure has
recently occurred or is about to occur. When BGP
messages and failures coincide, BGP messages
most often follow failures by 4 minutes.
Fig. 13
and 11
Table 1: Summary of major results.
Genuity
AS 10578
Border Router
AS 3 (MIT)
Internet 2
iBGP
Collection Host
AS 1
Figure 2: At each collection host, we initiate active probes
and collect BGP messages from the network’s border router.
The figure shows the configuration for MIT, which obtains
upstream connectivity from Genuity (AS 1) and the North-
east Exchange (via AS 10578).
counted the degree for each of the 15,040 nodes in the AS graph
from the Routeviews table dump of March 13, 2003 at Midnight
PST (this technique is a commonly accepted way for approxi-
mating the size of an AS [10]). On this date, the paths in our
testbed topology traversed 9 of the 11 of AS’s that have an AS
degree larger than 500 and nearly one-half of the 54 AS’s that
have a degree larger than 100.
We collect data (1) to measure the end-to-end connectivity
between hosts using active probes, (2) to determine the loca-
tion of observed failures using traceroutes to locations found un-
reachable by the active probes, and (3) to correlate BGP routing
changes with failures observed by active probes.
2.1 Active probing
An active probe consists of a request packet from the ini-
tiator to the target and, if the request gets through, a reply
packet from target to initiator. Each probe has a 32-bit sequence
number, which the hosts log along with the time at which packets
were both sent and received. This approach allows us to compute
the one-way reachability between the hosts. A central monitor-
ing machine periodically collects and aggregates these logs as
described in Section 3.1. Our post-processing finds all probes
received within 60 minutes of when they were sent; this margin
accounts for clock skew of up to one hour if time synchroniza-
tion fails.
2
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



