Answering what-if deployment and configuration questions with wise
ACM SIGCOMM Computer Communication Review (2008)
- ISSN: 01464833
- ISBN: 9781605581750
- DOI: 10.1145/1402946.1402971
Available from
Nick Feamster's profile on Mendeley.
or
Author-supplied keywords
Available from
Nick Feamster's profile on Mendeley.
Page 1
Answering what-if deployment and configuration questions with wise
Answering “What-If” Deployment and Configuration
Questions with WISE∗
Mukarram Bin Tariq‡†, Amgad Zeitoun§, Vytautas Valancius‡, Nick Feamster‡, Mostafa Ammar‡
mtariq@cc.gatech.edu, amgad@google.com, {valas,feamster,ammar}@cc.gatech.edu
‡ School of Computer Science, Georgia Tech. Atlanta, GA § Google Inc. Mountain View, CA
Abstract
Designers of content distribution networks often need to determine
how changes to infrastructure deployment and configuration affect
service response times when they deploy a new data center, change
ISP peering, or change the mapping of clients to servers. Today, the
designers use coarse, back-of-the-envelope calculations, or costly
field deployments; they need better ways to evaluate the effects
of such hypothetical “what-if” questions before the actual deploy-
ments. This paper presents What-If Scenario Evaluator (WISE),
a tool that predicts the effects of possible configuration and de-
ployment changes in content distribution networks. WISE makes
three contributions: (1) an algorithm that uses traces from exist-
ing deployments to learn causality among factors that affect service
response-time distributions; (2) an algorithm that uses the learned
causal structure to estimate a dataset that is representative of the
hypothetical scenario that a designer may wish to evaluate, and
uses these datasets to predict future response-time distributions;
(3) a scenario specification language that allows a network designer
to easily express hypothetical deployment scenarios without being
cognizant of the dependencies between variables that affect service
response times. Our evaluation, both in a controlled setting and
in a real-world field deployment at a large, global CDN, shows that
WISE can quickly and accurately predict service response-time dis-
tributions for many practical what-if scenarios.
Categories and Subject Descriptors: C.2.3 [Computer Commu-
nication Networks]: Network Operations, Network Management
General Terms: Algorithms, Design, Management, Performance
Keywords: What-if Scenario Evaluation, Content Distribution
Networks, Performance Modeling
1. INTRODUCTION
Content distribution networks (CDNs) for Web-based services
comprise hundreds to thousands of distributed servers and data cen-
∗This work is supported in part by NSF Awards CNS-0643974,
CNS-0721581, and CNS-0721559.
†Work performed while the author was visiting Google Inc.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGCOMM’08, August 17–22, 2008, Seattle, Washington, USA.
Copyright 2008 ACM 978-1-60558-175-0/08/08 ...$5.00.
ters [1, 3, 9]. Operators of these networks continually strive to im-
prove the response times for their services. To perform this task,
they must be able to predict how service response-time distribution
changes in various hypothetical what-if scenarios, such as changes
to network conditions and deployments of new infrastructure. In
many cases, they must also be able to reason about the detailed ef-
fects of these changes (e.g., what fraction of the users will see at
least a 10% improvement in performance because of this change?),
as opposed to just coarse-grained point estimates or averages.
Various factors on both short and long timescales affect a CDN’s
service response time. On short timescales, response time can be
affected by routing instability or changes in server load. Occasion-
ally, the network operators may “drain” a data center for mainte-
nance and divert the client requests to an alternative location. In
the longer term, service providers may upgrade their existing facil-
ities, move services to different facilities or deploy new data centers
to address demands and application requirements, or change peer-
ing and customer relationships with neighboring ISPs. These in-
stances require significant planning and investment; some of these
decisions are hard to implement and even more difficult to reverse.
Unfortunately, reasoning about the effects of any of these
changes is extremely challenging in practice. Content distribution
networks are complex systems, and the response time perceived by
a user can be affected by a variety of inter-dependent and correlated
factors. Such factors are difficult to accurately model or reason
about and back-of-the-envelope calculations are not precise.
This paper presents the design, implementation, and evaluation
of What-If Scenario Evaluator (WISE), a tool that estimates the ef-
fects of possible changes to network configuration and deployment
scenarios on the service response time. WISE uses statistical learn-
ing techniques to provide a largely automated way of interpreting
the what-if questions as statistical interventions. WISE takes as in-
put packet traces from Web transactions to model factors that af-
fect service response-time prediction. Using this model, WISE also
transforms the existing datasets to produce a new datasets that are
representative of the what-if scenarios and are also faithful to the
working of the system, and finally uses these to estimate the sys-
tem response time distribution.
Although function estimation using passive datasets is a common
application in the field of machine learning, using these techniques
is not straightforward because they can only predict the response-
time distribution for a what-if scenario accurately if the estimated
function receives an input distribution that is representative of the
what-if scenario. Providing this input distribution presents difficul-
ties at several levels, and is the key problem that WISE solves.
WISE tackles the following specific challenges. First, WISE
must allow the network designers to easily specify what-if sce-
narios. A designer might specify a what-if scenario to change the
99
Questions with WISE∗
Mukarram Bin Tariq‡†, Amgad Zeitoun§, Vytautas Valancius‡, Nick Feamster‡, Mostafa Ammar‡
mtariq@cc.gatech.edu, amgad@google.com, {valas,feamster,ammar}@cc.gatech.edu
‡ School of Computer Science, Georgia Tech. Atlanta, GA § Google Inc. Mountain View, CA
Abstract
Designers of content distribution networks often need to determine
how changes to infrastructure deployment and configuration affect
service response times when they deploy a new data center, change
ISP peering, or change the mapping of clients to servers. Today, the
designers use coarse, back-of-the-envelope calculations, or costly
field deployments; they need better ways to evaluate the effects
of such hypothetical “what-if” questions before the actual deploy-
ments. This paper presents What-If Scenario Evaluator (WISE),
a tool that predicts the effects of possible configuration and de-
ployment changes in content distribution networks. WISE makes
three contributions: (1) an algorithm that uses traces from exist-
ing deployments to learn causality among factors that affect service
response-time distributions; (2) an algorithm that uses the learned
causal structure to estimate a dataset that is representative of the
hypothetical scenario that a designer may wish to evaluate, and
uses these datasets to predict future response-time distributions;
(3) a scenario specification language that allows a network designer
to easily express hypothetical deployment scenarios without being
cognizant of the dependencies between variables that affect service
response times. Our evaluation, both in a controlled setting and
in a real-world field deployment at a large, global CDN, shows that
WISE can quickly and accurately predict service response-time dis-
tributions for many practical what-if scenarios.
Categories and Subject Descriptors: C.2.3 [Computer Commu-
nication Networks]: Network Operations, Network Management
General Terms: Algorithms, Design, Management, Performance
Keywords: What-if Scenario Evaluation, Content Distribution
Networks, Performance Modeling
1. INTRODUCTION
Content distribution networks (CDNs) for Web-based services
comprise hundreds to thousands of distributed servers and data cen-
∗This work is supported in part by NSF Awards CNS-0643974,
CNS-0721581, and CNS-0721559.
†Work performed while the author was visiting Google Inc.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGCOMM’08, August 17–22, 2008, Seattle, Washington, USA.
Copyright 2008 ACM 978-1-60558-175-0/08/08 ...$5.00.
ters [1, 3, 9]. Operators of these networks continually strive to im-
prove the response times for their services. To perform this task,
they must be able to predict how service response-time distribution
changes in various hypothetical what-if scenarios, such as changes
to network conditions and deployments of new infrastructure. In
many cases, they must also be able to reason about the detailed ef-
fects of these changes (e.g., what fraction of the users will see at
least a 10% improvement in performance because of this change?),
as opposed to just coarse-grained point estimates or averages.
Various factors on both short and long timescales affect a CDN’s
service response time. On short timescales, response time can be
affected by routing instability or changes in server load. Occasion-
ally, the network operators may “drain” a data center for mainte-
nance and divert the client requests to an alternative location. In
the longer term, service providers may upgrade their existing facil-
ities, move services to different facilities or deploy new data centers
to address demands and application requirements, or change peer-
ing and customer relationships with neighboring ISPs. These in-
stances require significant planning and investment; some of these
decisions are hard to implement and even more difficult to reverse.
Unfortunately, reasoning about the effects of any of these
changes is extremely challenging in practice. Content distribution
networks are complex systems, and the response time perceived by
a user can be affected by a variety of inter-dependent and correlated
factors. Such factors are difficult to accurately model or reason
about and back-of-the-envelope calculations are not precise.
This paper presents the design, implementation, and evaluation
of What-If Scenario Evaluator (WISE), a tool that estimates the ef-
fects of possible changes to network configuration and deployment
scenarios on the service response time. WISE uses statistical learn-
ing techniques to provide a largely automated way of interpreting
the what-if questions as statistical interventions. WISE takes as in-
put packet traces from Web transactions to model factors that af-
fect service response-time prediction. Using this model, WISE also
transforms the existing datasets to produce a new datasets that are
representative of the what-if scenarios and are also faithful to the
working of the system, and finally uses these to estimate the sys-
tem response time distribution.
Although function estimation using passive datasets is a common
application in the field of machine learning, using these techniques
is not straightforward because they can only predict the response-
time distribution for a what-if scenario accurately if the estimated
function receives an input distribution that is representative of the
what-if scenario. Providing this input distribution presents difficul-
ties at several levels, and is the key problem that WISE solves.
WISE tackles the following specific challenges. First, WISE
must allow the network designers to easily specify what-if sce-
narios. A designer might specify a what-if scenario to change the
99
Page 2
value of some network features relative to their values in an existing
or “baseline” deployment. The designer may not know that such a
change might also affect other features (or how the features are
related). WISE’s interface shields the designers from this complex-
ity. WISE provides a scenario specification language that allows
network designers to succinctly specify hypothetical scenarios for
arbitrary subsets of existing networks and to specify what-if val-
ues for different features. WISE’s specification language is simple:
evaluating a hypothetical deployment of a new proxy server for a
subset of users can be specified in only 2 to 3 lines of code.
Second, because the designer can specify a what-if scenario
without being aware of these dependencies, WISE must automat-
ically produce an accurate dataset that is both representative of
the what-if scenario the designer specifies and consistent with the
underlying dependencies. WISE uses a causal dependency discov-
ery algorithm to discover the dependencies among variables and
a statistical intervention evaluation technique to transform the ob-
served dataset to a representative and consistent dataset. WISE then
uses a non-parametric regression method to estimate the response
time as a piece-wise smooth function for this dataset. We have
used WISE to predict service response times in both controlled set-
tings on the Emulab testbed and for Google’s global CDN for its
Web-search service. Our evaluation shows that WISE’s predictions
of response-time distribution are very accurate, yielding a median
error between 8% and 11% for cross-validation with existing de-
ployments and only 9% maximum cumulative distribution differ-
ence compared to ground-truth response time distribution for what-
if scenarios on a real deployment as well as controlled experiments
on Emulab.
Finally, WISE must be fast, so that it can be used for short-term
and frequently arising questions. Because the methods relying on
statistical inference are often computationally intensive, we have
tailored WISE for parallel computation and implemented it using
the Map-Reduce [16] framework, which allows us to process large
datasets comprising hundreds of millions of records quickly and
produce accurate predictions for response-time distributions.
The paper proceeds as follows. Section 2 describes the problem
scope and motivation. Section 3 makes the case for using statistical
learning for the problem of what-if scenario evaluation. Section 4
provides an overview of WISE, and Section 5 describes WISE’s al-
gorithms in detail. We discuss the implementation in Section 6.
In Section 7, we evaluate WISE for response-time estimation for
existing deployments as well as for a what-if scenario based on a
real operational event. In Section 8, we evaluate WISE for what-if
scenarios for a small-scale network built on the Emulab testbed. In
Section 9, we discuss various properties of the WISE system and
how it relates to other areas in networking. We review related work
in Section 10, and conclude in Section 11.
2. PROBLEM CONTEXT AND SCOPE
This section describes common what-if’ questions that the net-
work designers pose when evaluating potential configuration or de-
ployment changes to an existing content distribution network de-
ployment.
Content Distribution Networks: Most CDNs conform to a two-
tier architecture. The first tier comprises a set of globally dis-
tributed front-end (FE) servers that, depending on the specific im-
plementation, provide caching, content assembly, pipelining, re-
quest redirection, and proxy functions. The second tier comprises
backend (BE) servers that implement the application logic, and
which might also be replicated and globally distributed. The FE
and BE servers may belong to a single administrative entity (as is
(a) Before the Maintenance
(b) During the Maintenance
Figure 1: Network configuration for customers in India.
the case with Google [3]) or to different administrative entities, as
with commercial content distribution networking service providers,
such as Akamai [1]. The network path between the FE and BE
servers may be over a public network or a private network, or a
LAN when the two are co-located. CDNs typically use DNS redi-
rection or URL-rewriting [13] to direct the users to the appropriate
FE and BE servers; this redirection may be based on the user’s
proximity, geography, availability, and relative server load.
An Example “What-if” Scenario: The network designers may
want to ask a variety of what-if questions about the CDN configu-
ration. For example, the network designers may want to determine
the effects of deploying new FE or BE servers, changing the serving
FE or BE servers for a subset of users, changing the size of typical
responses, increasing capacity, or changing network connectivity,
on the service response time. Following is a real what-if scenario
from Google’s CDN for the Web-search service.
Figure 1 shows an example of a change in network deployment
that could affect server response time. Google has an FE data cen-
ter in India that serves users in India and surrounding regions. This
FE data center uses BE servers located elsewhere in the world, in-
cluding the ones located in Taiwan. On July 16, 2007, the FE data
center in India was temporarily “drained” for maintenance reasons,
and the traffic was diverted to a FE data center that is co-located
with BE in Taiwan, resulting in a change in latency for the users in
India. This change in the network configuration can be described
as a what-if scenario in terms of change of the assigned FE, or more
explicitly as changes in delays between FE and clients that occur
due to the new configuration. WISE aims to predict the response-
time distribution for reconfigurations before they are deployed in
practice.
3. A CASE FOR MACHINE LEARNING
In this section, we present two aspects of what-if scenario evalua-
tion that make the problem well-suited for machine learning: (1) an
underlying model that is difficult to derive from first principles but
provides a wealth of data; (2) a need to predict outcomes based on
data that may not directly represent the desired what-if scenario.
The system is complex, but observable variables are driven by
fundamental properties of the system. Unfortunately, in large
complex distributed systems, such as CDNs, the parameters that
govern the system performance, the relationships between these
variables, as well as the functions that govern the response-time
distribution of the system, are often complex and are character-
ized by randomness and variability that are difficult to model as
simple readily evaluatable formulas. Fortunately, the underlying
fundamental properties and dependencies that determine a CDN’s
100
or “baseline” deployment. The designer may not know that such a
change might also affect other features (or how the features are
related). WISE’s interface shields the designers from this complex-
ity. WISE provides a scenario specification language that allows
network designers to succinctly specify hypothetical scenarios for
arbitrary subsets of existing networks and to specify what-if val-
ues for different features. WISE’s specification language is simple:
evaluating a hypothetical deployment of a new proxy server for a
subset of users can be specified in only 2 to 3 lines of code.
Second, because the designer can specify a what-if scenario
without being aware of these dependencies, WISE must automat-
ically produce an accurate dataset that is both representative of
the what-if scenario the designer specifies and consistent with the
underlying dependencies. WISE uses a causal dependency discov-
ery algorithm to discover the dependencies among variables and
a statistical intervention evaluation technique to transform the ob-
served dataset to a representative and consistent dataset. WISE then
uses a non-parametric regression method to estimate the response
time as a piece-wise smooth function for this dataset. We have
used WISE to predict service response times in both controlled set-
tings on the Emulab testbed and for Google’s global CDN for its
Web-search service. Our evaluation shows that WISE’s predictions
of response-time distribution are very accurate, yielding a median
error between 8% and 11% for cross-validation with existing de-
ployments and only 9% maximum cumulative distribution differ-
ence compared to ground-truth response time distribution for what-
if scenarios on a real deployment as well as controlled experiments
on Emulab.
Finally, WISE must be fast, so that it can be used for short-term
and frequently arising questions. Because the methods relying on
statistical inference are often computationally intensive, we have
tailored WISE for parallel computation and implemented it using
the Map-Reduce [16] framework, which allows us to process large
datasets comprising hundreds of millions of records quickly and
produce accurate predictions for response-time distributions.
The paper proceeds as follows. Section 2 describes the problem
scope and motivation. Section 3 makes the case for using statistical
learning for the problem of what-if scenario evaluation. Section 4
provides an overview of WISE, and Section 5 describes WISE’s al-
gorithms in detail. We discuss the implementation in Section 6.
In Section 7, we evaluate WISE for response-time estimation for
existing deployments as well as for a what-if scenario based on a
real operational event. In Section 8, we evaluate WISE for what-if
scenarios for a small-scale network built on the Emulab testbed. In
Section 9, we discuss various properties of the WISE system and
how it relates to other areas in networking. We review related work
in Section 10, and conclude in Section 11.
2. PROBLEM CONTEXT AND SCOPE
This section describes common what-if’ questions that the net-
work designers pose when evaluating potential configuration or de-
ployment changes to an existing content distribution network de-
ployment.
Content Distribution Networks: Most CDNs conform to a two-
tier architecture. The first tier comprises a set of globally dis-
tributed front-end (FE) servers that, depending on the specific im-
plementation, provide caching, content assembly, pipelining, re-
quest redirection, and proxy functions. The second tier comprises
backend (BE) servers that implement the application logic, and
which might also be replicated and globally distributed. The FE
and BE servers may belong to a single administrative entity (as is
(a) Before the Maintenance
(b) During the Maintenance
Figure 1: Network configuration for customers in India.
the case with Google [3]) or to different administrative entities, as
with commercial content distribution networking service providers,
such as Akamai [1]. The network path between the FE and BE
servers may be over a public network or a private network, or a
LAN when the two are co-located. CDNs typically use DNS redi-
rection or URL-rewriting [13] to direct the users to the appropriate
FE and BE servers; this redirection may be based on the user’s
proximity, geography, availability, and relative server load.
An Example “What-if” Scenario: The network designers may
want to ask a variety of what-if questions about the CDN configu-
ration. For example, the network designers may want to determine
the effects of deploying new FE or BE servers, changing the serving
FE or BE servers for a subset of users, changing the size of typical
responses, increasing capacity, or changing network connectivity,
on the service response time. Following is a real what-if scenario
from Google’s CDN for the Web-search service.
Figure 1 shows an example of a change in network deployment
that could affect server response time. Google has an FE data cen-
ter in India that serves users in India and surrounding regions. This
FE data center uses BE servers located elsewhere in the world, in-
cluding the ones located in Taiwan. On July 16, 2007, the FE data
center in India was temporarily “drained” for maintenance reasons,
and the traffic was diverted to a FE data center that is co-located
with BE in Taiwan, resulting in a change in latency for the users in
India. This change in the network configuration can be described
as a what-if scenario in terms of change of the assigned FE, or more
explicitly as changes in delays between FE and clients that occur
due to the new configuration. WISE aims to predict the response-
time distribution for reconfigurations before they are deployed in
practice.
3. A CASE FOR MACHINE LEARNING
In this section, we present two aspects of what-if scenario evalua-
tion that make the problem well-suited for machine learning: (1) an
underlying model that is difficult to derive from first principles but
provides a wealth of data; (2) a need to predict outcomes based on
data that may not directly represent the desired what-if scenario.
The system is complex, but observable variables are driven by
fundamental properties of the system. Unfortunately, in large
complex distributed systems, such as CDNs, the parameters that
govern the system performance, the relationships between these
variables, as well as the functions that govern the response-time
distribution of the system, are often complex and are character-
ized by randomness and variability that are difficult to model as
simple readily evaluatable formulas. Fortunately, the underlying
fundamental properties and dependencies that determine a CDN’s
100
Page 3
response time can be observed as correlations and joint probabil-
ity distributions of the variables that define the system, including
the service response time. By observing these joint distributions
(e.g., response times observed under various conditions), machine
learning algorithms can infer the underlying function that affects
the response time. Because most production CDNs collect compre-
hensive datasets for their services as part of everyday operational
and monitoring needs, the requisite datasets are typically readily
available.
Obtaining datasets that directly represent the what-if scenario
is challenging. Once the response-time function is learned, evalu-
ating a what-if scenario requires providing this function with input
data that is representative of the what-if scenario. Unfortunately,
data collected from an existing network deployment only represents
the current setup, and the system complexities make it difficult for
a designer to manually “transform” the data to represent the new
scenario. Fortunately, depending on the extent of the dataset that is
collected and the nature of what-if scenario, machine learning al-
gorithms can reveal the dependencies among the variables and use
the dependency structure to intelligently re-weigh and re-sample
the different parts of the existing dataset to perform this transfor-
mation. In particular, if the what-if scenario is expressed in terms
of the changes to values of the variables that are observed in the
dataset and the changed values or similar values of these variables
are observed in the dataset even with small densities in the origi-
nal dataset, then we can transform the original dataset to one that
is representative of the what-if scenario as well as the underlying
principles of the system, while requiring minimal input from the
network designer.
4. WISE: HIGH-LEVEL DESIGN
WISE entails four steps: (1) identifying features in the dataset
that affect response time; (2) constraining the inputs to “valid” sce-
narios based on existing dependencies; (3) specifying the what-if
scenario; (4) estimating the response-time function and distribu-
tion. Each of these tasks raises a number of challenges, some of
which are general problems with applying statistical learning in
practice, and others are specific to what-if scenario evaluation. This
section provides an overview and necessary background for these
steps. Section 5 discuss the mechanisms in more depth; the techni-
cal report [24] provides additional details and background.
1. Identifying Relevant Features: The main input to WISE is a
comprehensive dataset that covers many combinations of variables.
Most CDNs have existing network monitoring infrastructure that
can typically provide such a dataset. This dataset, however, may
contain variables that are not relevant to the response-time function.
WISE extracts the set of relevant variables from the dataset and dis-
cards the rest of the variables. WISE can also identify whether there
are missing or latent variables that may hamper scenario evaluation
(Sections 5.1 and 5.2 provide more details).
The nature of what-if scenarios that WISE can evaluate is limited
by the input dataset—careful choice of variables that the monitor-
ing infrastructure collects from a CDN can therefore enhance the
utility of the dataset for evaluating what-if scenarios, choosing such
variables is outside the scope of WISE system.
2. Preparing Dataset to Represent the What-if Scenario: Eval-
uating a what-if scenario requires values for input variables that
“make sense.” Specifically, an accurate prediction of the response-
time distribution for a what-if scenario requires a joint distribution
of the input variables that is representative of the scenario and is
also consistent with the dependencies that are inherent to the sys-
Figure 2: Main steps in the WISE approach.
tem itself. For instance, the distribution of the number of packets
that are transmitted in the duration of a service session depends on
the distribution of the size of content that the server returns in reply
to a request; if the distribution of content size changes, then the dis-
tribution for the number of packets that are transmitted must also
change in a way that is inherent to the system, e.g., the path-MTU
might determine the number of packets. Further, the change might
cascade to other variables that in turn depend on the number of
packets. To enforce such consistency WISE learns the dependency
structure among the variables and represents these relationships as
a Causal Bayesian Network (CBN) [20]. We provide a brief back-
ground of CBN in this Section and explain the algorithm for learn-
ing the CBN in Section 5.2.
A CBN represents the variables in the dataset as a Directed
Acyclic Graph (DAG). The nodes represent the variables and the
edges indicate whether there are dependencies among the variables.
A variable has a “causal” relationship with another variable, if a
change in the value of the first variable causes a change in the val-
ues of the later. When conditioned on its parent variables, a variable
xi in a CBN is independent of all other variables in the DAG except
its decedents; an optimal DAG for a dataset is one where we find
the minimal parents for each node that satisfy the above property.
x1
x2 x5
x4 x3
y
As an example of how the causal structure
may facilitate scenario specification and eval-
uation, consider a dataset with five input vari-
ables (x1 . . . x5), and target variable y. Sup-
pose that we discover a dependency struc-
ture among them as shown in the figure to
the right. If WISE is presented with a what-
if scenario that requires changes in the value
of variable x2, then the distributions for vari-
ables x1 and x5 remains unchanged in the input distribution, and
WISE needs to update only the distribution of the descendants of
x2 to maintain consistency. WISE constrains the input distribution
by intelligently re-sampling and re-weighing the dataset using the
causal structure as a guideline (see Section 5.4).
In general, correlation does not imply causation. Causal in-
terpretation of association or correlation requires that the dataset
is independent of the outcome variable (see the Counterfactual
Model [20, 25]). A biased dataset can result in false or missing
causal assertions; for example, we could falsely infer that a treat-
ment is effective if, by coincidence, the dataset is such that more pa-
tients that are treated are healthy that the ones that are not treated.
We can make the correct inference if we assign the patients ran-
domly to the treatment because then the dataset would be indepen-
dent of the outcome. Fortunately, because many computer network-
ing phenomena are fundamentally similar throughout the Internet,
we can assume that the datasets are unbiased. Still, frivolous rela-
tionships might arise; we address this further in Section 5.2.
3. Facilitating Scenario Specification: WISE presents the net-
work designers with an easy-to-use interface in the form of a
scenario specification language called WISE-Scenario Language
101
ity distributions of the variables that define the system, including
the service response time. By observing these joint distributions
(e.g., response times observed under various conditions), machine
learning algorithms can infer the underlying function that affects
the response time. Because most production CDNs collect compre-
hensive datasets for their services as part of everyday operational
and monitoring needs, the requisite datasets are typically readily
available.
Obtaining datasets that directly represent the what-if scenario
is challenging. Once the response-time function is learned, evalu-
ating a what-if scenario requires providing this function with input
data that is representative of the what-if scenario. Unfortunately,
data collected from an existing network deployment only represents
the current setup, and the system complexities make it difficult for
a designer to manually “transform” the data to represent the new
scenario. Fortunately, depending on the extent of the dataset that is
collected and the nature of what-if scenario, machine learning al-
gorithms can reveal the dependencies among the variables and use
the dependency structure to intelligently re-weigh and re-sample
the different parts of the existing dataset to perform this transfor-
mation. In particular, if the what-if scenario is expressed in terms
of the changes to values of the variables that are observed in the
dataset and the changed values or similar values of these variables
are observed in the dataset even with small densities in the origi-
nal dataset, then we can transform the original dataset to one that
is representative of the what-if scenario as well as the underlying
principles of the system, while requiring minimal input from the
network designer.
4. WISE: HIGH-LEVEL DESIGN
WISE entails four steps: (1) identifying features in the dataset
that affect response time; (2) constraining the inputs to “valid” sce-
narios based on existing dependencies; (3) specifying the what-if
scenario; (4) estimating the response-time function and distribu-
tion. Each of these tasks raises a number of challenges, some of
which are general problems with applying statistical learning in
practice, and others are specific to what-if scenario evaluation. This
section provides an overview and necessary background for these
steps. Section 5 discuss the mechanisms in more depth; the techni-
cal report [24] provides additional details and background.
1. Identifying Relevant Features: The main input to WISE is a
comprehensive dataset that covers many combinations of variables.
Most CDNs have existing network monitoring infrastructure that
can typically provide such a dataset. This dataset, however, may
contain variables that are not relevant to the response-time function.
WISE extracts the set of relevant variables from the dataset and dis-
cards the rest of the variables. WISE can also identify whether there
are missing or latent variables that may hamper scenario evaluation
(Sections 5.1 and 5.2 provide more details).
The nature of what-if scenarios that WISE can evaluate is limited
by the input dataset—careful choice of variables that the monitor-
ing infrastructure collects from a CDN can therefore enhance the
utility of the dataset for evaluating what-if scenarios, choosing such
variables is outside the scope of WISE system.
2. Preparing Dataset to Represent the What-if Scenario: Eval-
uating a what-if scenario requires values for input variables that
“make sense.” Specifically, an accurate prediction of the response-
time distribution for a what-if scenario requires a joint distribution
of the input variables that is representative of the scenario and is
also consistent with the dependencies that are inherent to the sys-
Figure 2: Main steps in the WISE approach.
tem itself. For instance, the distribution of the number of packets
that are transmitted in the duration of a service session depends on
the distribution of the size of content that the server returns in reply
to a request; if the distribution of content size changes, then the dis-
tribution for the number of packets that are transmitted must also
change in a way that is inherent to the system, e.g., the path-MTU
might determine the number of packets. Further, the change might
cascade to other variables that in turn depend on the number of
packets. To enforce such consistency WISE learns the dependency
structure among the variables and represents these relationships as
a Causal Bayesian Network (CBN) [20]. We provide a brief back-
ground of CBN in this Section and explain the algorithm for learn-
ing the CBN in Section 5.2.
A CBN represents the variables in the dataset as a Directed
Acyclic Graph (DAG). The nodes represent the variables and the
edges indicate whether there are dependencies among the variables.
A variable has a “causal” relationship with another variable, if a
change in the value of the first variable causes a change in the val-
ues of the later. When conditioned on its parent variables, a variable
xi in a CBN is independent of all other variables in the DAG except
its decedents; an optimal DAG for a dataset is one where we find
the minimal parents for each node that satisfy the above property.
x1
x2 x5
x4 x3
y
As an example of how the causal structure
may facilitate scenario specification and eval-
uation, consider a dataset with five input vari-
ables (x1 . . . x5), and target variable y. Sup-
pose that we discover a dependency struc-
ture among them as shown in the figure to
the right. If WISE is presented with a what-
if scenario that requires changes in the value
of variable x2, then the distributions for vari-
ables x1 and x5 remains unchanged in the input distribution, and
WISE needs to update only the distribution of the descendants of
x2 to maintain consistency. WISE constrains the input distribution
by intelligently re-sampling and re-weighing the dataset using the
causal structure as a guideline (see Section 5.4).
In general, correlation does not imply causation. Causal in-
terpretation of association or correlation requires that the dataset
is independent of the outcome variable (see the Counterfactual
Model [20, 25]). A biased dataset can result in false or missing
causal assertions; for example, we could falsely infer that a treat-
ment is effective if, by coincidence, the dataset is such that more pa-
tients that are treated are healthy that the ones that are not treated.
We can make the correct inference if we assign the patients ran-
domly to the treatment because then the dataset would be indepen-
dent of the outcome. Fortunately, because many computer network-
ing phenomena are fundamentally similar throughout the Internet,
we can assume that the datasets are unbiased. Still, frivolous rela-
tionships might arise; we address this further in Section 5.2.
3. Facilitating Scenario Specification: WISE presents the net-
work designers with an easy-to-use interface in the form of a
scenario specification language called WISE-Scenario Language
101
Page 4
(WSL). The designers can typically specify the baseline setup as
well as the hypothetical values for the scenario in 3-4 lines of WSL.
WSL allows the designers to evaluate a scenario for an arbitrary
subset of customers. WSL also provides a useful set of built-in
operators that facilitate scenario specification as relative changes to
the existing values of variables or as new values from scratch. With
WSL, the designers are completely shielded from the complexity
of dependencies among the variables, because WISE automatically
updates the dependent variables. We detail WSL and the process of
scenario specification and evaluation in Sections 5.3 and 5.4.
4. Estimating Response-Time Distribution: Datasets for typi-
cal CDN deployments and what-if scenarios span a large multi-
dimensional space. While non-parametric function estimation is a
standard application in the machine learning literature, the compu-
tational requirements for accurately estimating a function spanning
such a large space can be astronomical. To address this, WISE esti-
mates the function in a piece-wise manner, and also structures the
processing so that it is amenable to parallel processing. WISE also
uses the dependency structure to reduce the number of variables
that form the input to the regression function. Sections 5.5 and 5.6
provide more detail.
5. WISE SYSTEM
5.1 Feature Selection
Traditional machine-learning applications use various model se-
lection criteria, e.g., Akaike Information Criterion (AIC), Mallow’s
Cp Test, or k-fold cross-validation [25], for determining appropri-
ate subset of covariates for a learning problem. WISE forgoes the
traditional model selection techniques in favor of simple pair-wise
independence testing, because at times these techniques can ignore
variables that might have interpretive value for the designer.
WISE uses simple pair-wise independence tests on all the vari-
ables in the dataset with the response-time variable, and discards all
variables that it deems independent of the response-time variable.
For each categorical variable (variables that do not have numeric
meanings) in the dataset, such as, country of origin of a request, or
AS number, WISE obtains the conditional distributions of response
time for each categorical value, and discards the variable if all the
conditional distributions of response time are statistically similar.
To test this, we use Two-sample Kolmogorov-Smirnov (KS) good-
ness of fit test with a significance level of 10%.
For real-valued variables, WISE first tests for correlation with
the response-time variable, and retains a variable if the correlation
coefficient is greater than 10%. Unfortunately, for continuous vari-
ables, lack of correlation does not imply independence, so we can-
not outright discard a variable if we observe small correlation. A
typical example of such a variable in a dataset is the timestamp of
the Web transaction, where the correlation may cancel out over a
diurnal cycle. For such cases, we divide the range of the variable
in question into small buckets and treat each bucket as a category.
We then apply the same techniques as we do for the categorical
variables to determine whether the variable is independent. There
is still a possibility that we may discard a variable that is relevant,
but this outcome is less likely if sufficiently small buckets are used.
The bucket size depends on the variable in question; for instance,
we use one-hour buckets for the time-stamp variable in the datasets.
5.2 Learning the Causal Structure
To learn the causal structure, WISE first learns the undirected
graph and then uses a set of rules to orient the edges.
Learning the UndirectedGraph: Recall that in a Causal Bayesian
1: WCD (V,W0,∆)
/*Notation
V: set of all variables
W0: set of no-cause variables
∆: maximum allowable cardinality for separators
a ⊥ b: Variable a is independent of variable b */
2: Make a complete Graph on V
3: Remove all edges (a, b) if a ⊥ b
4: W = W0
5: for c = 1 to∆ /*prune in the order of increasing cardinality*/
6: LocalPrune (c)
1: LocalPrune (c)
/*Try to separate neighbors of frontier variables W*/
2: ∀w ∈W
3: ∀z ∈ N(w) /*neighbors of w*/
4: if ∃x ⊆ N(z)\w : |x| ≤ c, z ⊥ w|x
5: then /*found separator node(s)*/
Swz = x /*assign the separating nodes*/
6: Remove the edge (w, z)
7: Remove edges (w′, z), for all the nodes w′ ∈W
that are also on path from w to nodes in W0
/*Update the new frontier variables*/
8: W = W ∪ x
Figure 3: WISE Causal Discovery (WCD) algorithm.
Network (CBN), a variable, when conditioned on its parents, is in-
dependent of all other variables, except its descendants. Further
an optimal CBN requires finding the smallest possible set of par-
ents for each node that satisfy this condition. Thus by definition,
variables a and b in the CBN have an edge between them, if and
only if, there is a subset of separating variables, Sab, such that a
is independent of b given Sab. This, in the general case, requires
searching all the possible O(2n) combinations of the n variables in
the dataset
WISE-Causal Discovery Algorithm (WCD) (Figure 3) uses a
heuristic to guide the search of separating variables when we have
prior knowledge of a subset of variables that are “not caused” by
any other variables in the dataset, or that are determined by fac-
tors outside our system model (we refer to these variables as the
no-cause variables). Further, WCD does not perform exhaustive
search for separating variables, thus forgoing optimality for lower
complexity.
WCD starts with a fully connected undirected graph on the vari-
ables and removes the edges among variables that are clearly inde-
pendent. WCD then progressively finds separating nodes between
a restricted set of variables (that we call frontier variables), and
the rest of the variables in the dataset, in the order of increasing
cardinality of allowable separating variables. Initially the frontier
variables comprise only the no-cause variables. As WCD discovers
separating variables, it adds them to the set of frontier variables.
The algorithm terminates when it has explored separation sets up
to the maximum allowed cardinality ∆ ≤ n, resulting in a worse
case complexity of O(2∆). This termination condition means that
certain variables that are separable are not separated: this does not
result in false dependencies but potentially transitive dependencies
may be considered direct dependencies. This sub-optimality does
not affect the accuracy of the scenario datasets that WISE prepares,
but it reduces the efficiency because it leaves the graph to be denser
and the nodes having larger in-degree.
In the cases where the set of no-cause variables is unknown,
WISE relies on the PC-algorithm [23], which also performs search
102
well as the hypothetical values for the scenario in 3-4 lines of WSL.
WSL allows the designers to evaluate a scenario for an arbitrary
subset of customers. WSL also provides a useful set of built-in
operators that facilitate scenario specification as relative changes to
the existing values of variables or as new values from scratch. With
WSL, the designers are completely shielded from the complexity
of dependencies among the variables, because WISE automatically
updates the dependent variables. We detail WSL and the process of
scenario specification and evaluation in Sections 5.3 and 5.4.
4. Estimating Response-Time Distribution: Datasets for typi-
cal CDN deployments and what-if scenarios span a large multi-
dimensional space. While non-parametric function estimation is a
standard application in the machine learning literature, the compu-
tational requirements for accurately estimating a function spanning
such a large space can be astronomical. To address this, WISE esti-
mates the function in a piece-wise manner, and also structures the
processing so that it is amenable to parallel processing. WISE also
uses the dependency structure to reduce the number of variables
that form the input to the regression function. Sections 5.5 and 5.6
provide more detail.
5. WISE SYSTEM
5.1 Feature Selection
Traditional machine-learning applications use various model se-
lection criteria, e.g., Akaike Information Criterion (AIC), Mallow’s
Cp Test, or k-fold cross-validation [25], for determining appropri-
ate subset of covariates for a learning problem. WISE forgoes the
traditional model selection techniques in favor of simple pair-wise
independence testing, because at times these techniques can ignore
variables that might have interpretive value for the designer.
WISE uses simple pair-wise independence tests on all the vari-
ables in the dataset with the response-time variable, and discards all
variables that it deems independent of the response-time variable.
For each categorical variable (variables that do not have numeric
meanings) in the dataset, such as, country of origin of a request, or
AS number, WISE obtains the conditional distributions of response
time for each categorical value, and discards the variable if all the
conditional distributions of response time are statistically similar.
To test this, we use Two-sample Kolmogorov-Smirnov (KS) good-
ness of fit test with a significance level of 10%.
For real-valued variables, WISE first tests for correlation with
the response-time variable, and retains a variable if the correlation
coefficient is greater than 10%. Unfortunately, for continuous vari-
ables, lack of correlation does not imply independence, so we can-
not outright discard a variable if we observe small correlation. A
typical example of such a variable in a dataset is the timestamp of
the Web transaction, where the correlation may cancel out over a
diurnal cycle. For such cases, we divide the range of the variable
in question into small buckets and treat each bucket as a category.
We then apply the same techniques as we do for the categorical
variables to determine whether the variable is independent. There
is still a possibility that we may discard a variable that is relevant,
but this outcome is less likely if sufficiently small buckets are used.
The bucket size depends on the variable in question; for instance,
we use one-hour buckets for the time-stamp variable in the datasets.
5.2 Learning the Causal Structure
To learn the causal structure, WISE first learns the undirected
graph and then uses a set of rules to orient the edges.
Learning the UndirectedGraph: Recall that in a Causal Bayesian
1: WCD (V,W0,∆)
/*Notation
V: set of all variables
W0: set of no-cause variables
∆: maximum allowable cardinality for separators
a ⊥ b: Variable a is independent of variable b */
2: Make a complete Graph on V
3: Remove all edges (a, b) if a ⊥ b
4: W = W0
5: for c = 1 to∆ /*prune in the order of increasing cardinality*/
6: LocalPrune (c)
1: LocalPrune (c)
/*Try to separate neighbors of frontier variables W*/
2: ∀w ∈W
3: ∀z ∈ N(w) /*neighbors of w*/
4: if ∃x ⊆ N(z)\w : |x| ≤ c, z ⊥ w|x
5: then /*found separator node(s)*/
Swz = x /*assign the separating nodes*/
6: Remove the edge (w, z)
7: Remove edges (w′, z), for all the nodes w′ ∈W
that are also on path from w to nodes in W0
/*Update the new frontier variables*/
8: W = W ∪ x
Figure 3: WISE Causal Discovery (WCD) algorithm.
Network (CBN), a variable, when conditioned on its parents, is in-
dependent of all other variables, except its descendants. Further
an optimal CBN requires finding the smallest possible set of par-
ents for each node that satisfy this condition. Thus by definition,
variables a and b in the CBN have an edge between them, if and
only if, there is a subset of separating variables, Sab, such that a
is independent of b given Sab. This, in the general case, requires
searching all the possible O(2n) combinations of the n variables in
the dataset
WISE-Causal Discovery Algorithm (WCD) (Figure 3) uses a
heuristic to guide the search of separating variables when we have
prior knowledge of a subset of variables that are “not caused” by
any other variables in the dataset, or that are determined by fac-
tors outside our system model (we refer to these variables as the
no-cause variables). Further, WCD does not perform exhaustive
search for separating variables, thus forgoing optimality for lower
complexity.
WCD starts with a fully connected undirected graph on the vari-
ables and removes the edges among variables that are clearly inde-
pendent. WCD then progressively finds separating nodes between
a restricted set of variables (that we call frontier variables), and
the rest of the variables in the dataset, in the order of increasing
cardinality of allowable separating variables. Initially the frontier
variables comprise only the no-cause variables. As WCD discovers
separating variables, it adds them to the set of frontier variables.
The algorithm terminates when it has explored separation sets up
to the maximum allowed cardinality ∆ ≤ n, resulting in a worse
case complexity of O(2∆). This termination condition means that
certain variables that are separable are not separated: this does not
result in false dependencies but potentially transitive dependencies
may be considered direct dependencies. This sub-optimality does
not affect the accuracy of the scenario datasets that WISE prepares,
but it reduces the efficiency because it leaves the graph to be denser
and the nodes having larger in-degree.
In the cases where the set of no-cause variables is unknown,
WISE relies on the PC-algorithm [23], which also performs search
102
Page 5
for separating nodes in the order of increasing cardinality among
all pair of variables, but not using the frontier variables.
Orienting the Edges: WISE orients the edges and attempts to de-
tect latent variables using the following simple rules, well known
in the literature; we reproduce the rules here for convenience and
refer the reader to [20] for further details.
1. Add outgoing edges from the no-cause variables.
2. If node c has nonadjacent neighbors a and b, and c ∈ Sab,
then orient edges a→ c← b (unmarked edges).
3. For all nonadjacent nodes, a, b with a common neighbor c, if
there is an edge from a to c, but not from b to c, then add a
marked edge c ∗→ b.
4. If a and b are adjacent and there is directed path of only
marked edges from a to b, then add a→ b
In the resulting graph, any unmarked, bi-directed, or undirected
edges signify possible latent variables and ambiguity in causal
structure. In particular, a → b means either a really causes b or
there is a common latent cause L causing both a and b. Similarly,
a ↔ b, signifies a definite common latent cause, and undirected
edge between a and b implies either a causes b, b causes a, or a
common latent cause in the underlying model.
Addressing False Causal Relationships: False or missing causal
relationships can occur if the population in the dataset is not in-
dependent of the outcome variables. Unfortunately, because WISE
relies on passive datasets this is a fundamental limitation that can-
not be avoided. However, we expect that because the basic prin-
ciples of computer networks are similar across the Internet, and
the service providers use essentially the same versions of software
throughout their networks, the bias in the dataset that would sig-
nificantly affect the causal interpretation is not common. If such
biases do exist, they will likely be among datasets from different
geographical deployment regions. To catch such biases, we rec-
ommend using a training dataset with WISE that is obtained from
different geographical locations. We can infer causal structure for
each geographical region separately; if the learned structure is dif-
ferent, the differences must be carefully examined in light of the
knowledge of systems internal working.
Lastly, while WISE depends on the CBN for preparing the sce-
nario dataset, it is not necessary that the CBN is learned automati-
cally from the dataset; the CBN can be supplied, entirely, or in part
by a designer who is well-versed with the system.
5.3 Specifying the “What-If” Scenarios
Figure 4 shows the grammar for WISE-Specification Language
(WSL). A scenario specification with WSL comprises a use-
statement, followed by optional scenario update-statements.
The use-statement specifies a condition that describes the subset
of present network for which the designer is interested in evaluat-
ing the scenario. This statement provides a powerful interface to
the designer for choosing the baseline scenario: depending on the
features available in the dataset, the designer can specify a subset of
network based on location of clients (such as country, network ad-
dress, or AS number), the location of servers, properties of service
sessions, or a combination of these attributes.
The update-statements allow the designer to specify what-if val-
ues for various variables for the service session properties. Each
scenario statement begins with either the INTERVENE, or the AS-
SUME keyword and allows conditional modification of exactly one
variable in the dataset.
When the statement begins with the INTERVENE keyword,
WISE first updates the value of the variable in question. WISE then
uses the causal dependency structure to make the dataset consistent
scenario = use_stmt {update_stmt};
use_stmt = "USE" ("*" | condition_stmt)<EOL>;
update_stmt = ("ASSUME"|"INTERVENE") (set_directive |
setdist_directive) [condition_stmt]<EOL>;
set_directive = "SET" ["RADIAL"* | "FIXED"]
var set_op value;
setdist_directive = "SETDIST" feature
dist_name([param])| "FILE" filename);
condition_clause = "WHERE" condition;
condition = simple_cond | compound_cond;
simple_cond = compare_clause | (simple_cond);
compound_cond = (simple_cond ("AND"|"OR")
(simple_cond|compound_cond));
compare_clause = (var rel_op value) | member_test;
member_test = feature "IN" (value {,value});
set_op = "+=" | "-=" | "*=" | "\=" | "=";
rel_op = "<=" | ">=" | "<>" | "==" | "<" | ">";
var = a variable from the dataset;
Figure 4: Grammar for WISE Specification Language (WSL).
with the underlying dependencies. For this WISE uses a process
called Statistical Intervention Effect Evaluation (Section 5.4).
Advanced designers can override the intelligent update behavior
by using the ASSUME keyword in the update statement. In this
case WISE updates the distribution of the variable specified in the
statement but does not attempt to ensure that the distribution of the
dependent variables are correspondingly updated. WISE allows this
functionality for cases where the designers believe that the scenario
that they wish to evaluate involves changes to the underlying invari-
ant laws that govern the system. Examples of scenario specification
with WSL will follow in Section 7.
5.4 Preparing Representative Distribution for
the “What-If” Scenarios
This section describes how WISE uses the dataset, the causal
structure, and the scenario specification from the designer to pre-
pare a meaningful dataset for the what-if scenario.
WISE first filters the global dataset for the entries that match the
conditions specified in the use-statement of the scenario specifica-
tion to create the baseline dataset. WISE then executes the update-
statements, one statement at a time, to change the baseline dataset.
To ensure consistency among variables after every INTERVENE
update statement, WISE employs a process called Statistical Inter-
vention Effect Evaluation; the process is described below.
Let us denote the action requested on a variable xi in the update-
statement as set(xi). We refer to xi as the intervened variable. Let
us also denote the set of variables that are children of xi in the CBN
for the dataset as C(xi). Then the statistical intervention effect eval-
uation process states that the new distribution of children of xi is
given as: Pr{C(xi)|set(xi)}. The intuition is that because the par-
ent node in a CBN has a causal effect on its descendent nodes, we
expect that a change in the value of the parent variable must cause
a change in the value of the children. Further, the new distribution
of children variables would be one that we would expect to observe
under the changed values of the parent variable.
To apply this process, WISE conditions the global dataset on the
new value of the intervened variable, set(xi), and the existing val-
ues of the all the other parents of the children of the intervened
variable, P(C(xi)), in the baseline dataset to obtain an empirical
distribution. WISE then assigns the children a random value from
this distribution. WISE thus obtains a subset of the global dataset in
which the distribution of C(xi) is consistent with the action set(xi)
as well as the underlying dependencies.
Because the causal effect cascades to all the decedents of xi,
WISE repeats this process recursively, considering C(xi) as the in-
103
all pair of variables, but not using the frontier variables.
Orienting the Edges: WISE orients the edges and attempts to de-
tect latent variables using the following simple rules, well known
in the literature; we reproduce the rules here for convenience and
refer the reader to [20] for further details.
1. Add outgoing edges from the no-cause variables.
2. If node c has nonadjacent neighbors a and b, and c ∈ Sab,
then orient edges a→ c← b (unmarked edges).
3. For all nonadjacent nodes, a, b with a common neighbor c, if
there is an edge from a to c, but not from b to c, then add a
marked edge c ∗→ b.
4. If a and b are adjacent and there is directed path of only
marked edges from a to b, then add a→ b
In the resulting graph, any unmarked, bi-directed, or undirected
edges signify possible latent variables and ambiguity in causal
structure. In particular, a → b means either a really causes b or
there is a common latent cause L causing both a and b. Similarly,
a ↔ b, signifies a definite common latent cause, and undirected
edge between a and b implies either a causes b, b causes a, or a
common latent cause in the underlying model.
Addressing False Causal Relationships: False or missing causal
relationships can occur if the population in the dataset is not in-
dependent of the outcome variables. Unfortunately, because WISE
relies on passive datasets this is a fundamental limitation that can-
not be avoided. However, we expect that because the basic prin-
ciples of computer networks are similar across the Internet, and
the service providers use essentially the same versions of software
throughout their networks, the bias in the dataset that would sig-
nificantly affect the causal interpretation is not common. If such
biases do exist, they will likely be among datasets from different
geographical deployment regions. To catch such biases, we rec-
ommend using a training dataset with WISE that is obtained from
different geographical locations. We can infer causal structure for
each geographical region separately; if the learned structure is dif-
ferent, the differences must be carefully examined in light of the
knowledge of systems internal working.
Lastly, while WISE depends on the CBN for preparing the sce-
nario dataset, it is not necessary that the CBN is learned automati-
cally from the dataset; the CBN can be supplied, entirely, or in part
by a designer who is well-versed with the system.
5.3 Specifying the “What-If” Scenarios
Figure 4 shows the grammar for WISE-Specification Language
(WSL). A scenario specification with WSL comprises a use-
statement, followed by optional scenario update-statements.
The use-statement specifies a condition that describes the subset
of present network for which the designer is interested in evaluat-
ing the scenario. This statement provides a powerful interface to
the designer for choosing the baseline scenario: depending on the
features available in the dataset, the designer can specify a subset of
network based on location of clients (such as country, network ad-
dress, or AS number), the location of servers, properties of service
sessions, or a combination of these attributes.
The update-statements allow the designer to specify what-if val-
ues for various variables for the service session properties. Each
scenario statement begins with either the INTERVENE, or the AS-
SUME keyword and allows conditional modification of exactly one
variable in the dataset.
When the statement begins with the INTERVENE keyword,
WISE first updates the value of the variable in question. WISE then
uses the causal dependency structure to make the dataset consistent
scenario = use_stmt {update_stmt};
use_stmt = "USE" ("*" | condition_stmt)<EOL>;
update_stmt = ("ASSUME"|"INTERVENE") (set_directive |
setdist_directive) [condition_stmt]<EOL>;
set_directive = "SET" ["RADIAL"* | "FIXED"]
var set_op value;
setdist_directive = "SETDIST" feature
dist_name([param])| "FILE" filename);
condition_clause = "WHERE" condition;
condition = simple_cond | compound_cond;
simple_cond = compare_clause | (simple_cond);
compound_cond = (simple_cond ("AND"|"OR")
(simple_cond|compound_cond));
compare_clause = (var rel_op value) | member_test;
member_test = feature "IN" (value {,value});
set_op = "+=" | "-=" | "*=" | "\=" | "=";
rel_op = "<=" | ">=" | "<>" | "==" | "<" | ">";
var = a variable from the dataset;
Figure 4: Grammar for WISE Specification Language (WSL).
with the underlying dependencies. For this WISE uses a process
called Statistical Intervention Effect Evaluation (Section 5.4).
Advanced designers can override the intelligent update behavior
by using the ASSUME keyword in the update statement. In this
case WISE updates the distribution of the variable specified in the
statement but does not attempt to ensure that the distribution of the
dependent variables are correspondingly updated. WISE allows this
functionality for cases where the designers believe that the scenario
that they wish to evaluate involves changes to the underlying invari-
ant laws that govern the system. Examples of scenario specification
with WSL will follow in Section 7.
5.4 Preparing Representative Distribution for
the “What-If” Scenarios
This section describes how WISE uses the dataset, the causal
structure, and the scenario specification from the designer to pre-
pare a meaningful dataset for the what-if scenario.
WISE first filters the global dataset for the entries that match the
conditions specified in the use-statement of the scenario specifica-
tion to create the baseline dataset. WISE then executes the update-
statements, one statement at a time, to change the baseline dataset.
To ensure consistency among variables after every INTERVENE
update statement, WISE employs a process called Statistical Inter-
vention Effect Evaluation; the process is described below.
Let us denote the action requested on a variable xi in the update-
statement as set(xi). We refer to xi as the intervened variable. Let
us also denote the set of variables that are children of xi in the CBN
for the dataset as C(xi). Then the statistical intervention effect eval-
uation process states that the new distribution of children of xi is
given as: Pr{C(xi)|set(xi)}. The intuition is that because the par-
ent node in a CBN has a causal effect on its descendent nodes, we
expect that a change in the value of the parent variable must cause
a change in the value of the children. Further, the new distribution
of children variables would be one that we would expect to observe
under the changed values of the parent variable.
To apply this process, WISE conditions the global dataset on the
new value of the intervened variable, set(xi), and the existing val-
ues of the all the other parents of the children of the intervened
variable, P(C(xi)), in the baseline dataset to obtain an empirical
distribution. WISE then assigns the children a random value from
this distribution. WISE thus obtains a subset of the global dataset in
which the distribution of C(xi) is consistent with the action set(xi)
as well as the underlying dependencies.
Because the causal effect cascades to all the decedents of xi,
WISE repeats this process recursively, considering C(xi) as the in-
103
Page 6
tervened variables and updating the distributions of C(C(xi)), and
so on, until all the descendants of xi (except the target variable) are
updated. WISE cannot update the distribution of a descendant of xi
until the distribution of all of its ancestors that are descendant of
xi has been updated. WISE thus carefully orders the sequence of
the updates by traversing the CBN DAG breadth-first, beginning at
node xi.
WISE sequentially repeats this process for each statement in the
scenario specification. The updated dataset produced after each
statement serves as the input dataset for the next statement. Once
all the statements are executed, the dataset is the representative joint
distribution variables for the entire what-if scenario.
When the causal structure has ambiguities, WISE proceeds as
follows. When the edge between two variables is undirected, WISE
maintains the consistency by always updating the distribution of
one if the distribution of the other is updated. For latent variables
case, WISE assumes an imaginary variable, with directed edges to
variables a and b and uses the resulting structure to traverse the
graph while preparing the input distribution.
5.5 Estimating Response Time Distribution
Finding the new distribution of response time is also a case of
intervention effect evaluation process. We use a non-parametric
regression method to estimate the expected response-time distribu-
tion, instead of assigning a random value from the constrained em-
pirical distribution as in the previous section, because the designers
are interested in the expected values of the response time for each
request. In particular, we use a standard Kernel Regression (KR)
method, with a radial basis Kernel function (see [24, 26] for de-
tails) to estimate the response time for each request in the dataset.
To address the computational complexity, WISE applies the KR in
a piece-wise manner; the details follow in the next section.
5.6 Addressing the Computational Scalability
Because CDNs are complex systems, the response time may de-
pend on a large number of variables, and the dataset might com-
prise hundreds of millions of requests spanning a large multi-
dimensional space. To efficiently evaluate the what-if scenarios,
WISE must address how to efficiently organize and utilize the
dataset. In this section, we discuss our approach to these problems.
1. Curse of Dimensionality: As the number of dimensions (vari-
ables in the dataset) grow, exponentially more data is needed for
similar accuracy of estimation. WISE uses the CBN to mitigate this
problem. In particular, because when conditioned on its parents,
a variable is independent of all variables except its descendants,
we can use only the parents of the target variable in the regression
function. Because the cardinality of the parent-set would typically
be less than the total number of variables in the dataset, the accu-
racy of the regression function is significantly improved for a given
amount of data. Due to this, WISE can afford to use fewer training
data points with the regression function and still get good accuracy.
Also, because the time complexity for the KR method is O(kn3),
with k variables and n points in the training dataset, WISE’s tech-
nique results in significant computational speedup.
2. Overfitting: The density of the dataset from a real deployment
can be highly irregular; usually there are many points for combi-
nations of variable values that represent the normal network oper-
ation, while the density of dataset is sparser for combinations that
represent the fringe cases. Unfortunately, because the underlying
principle of most regression techniques is to find parameters that
minimize the errors on the training data, we can end up with param-
eters that minimize the error for high density regions of the dataset
but give poor results in the fringes—this problem is called overfit-
ting. The usual solution to this problem is introducing a smooth-
ness penalty in the objective function of the regression method, but
finding the right penalty function requires cross-validation, which
is usually at least quadratic in the size of the global dataset1. In the
case of CDNs, even one day of data may contain entries for mil-
lions of requests, which makes the quadratic complexity of these
algorithms inherently unscalable.
WISE uses piece-wise regression to address this problem. WISE
divides the dataset into small pieces, that we refer to as tiles and
performs regression independently for each tile. WISE further
prunes the global dataset to produce a training dataset so that the
density of training points is more or less even across all the tiles.
To decompose the dataset, WISE uses fixed-size buckets for each
dimension in the dataset for most of the variable value space. If the
bucket sizes are sufficiently small, having more data points beyond
a certain threshold does not contribute appreciably to the response-
time prediction accuracy. With this in mind, WISE uses two thresh-
olds nmin and nmax and proceeds with tile boundaries as follows.
WISE decides on a bucket width bi along each dimension i, and
forms boundaries in the space of the dataset at integer multiples of
bi along each dimension; for categorical variables, WISE uses each
category as a separate bucket. For each tile, WISE obtains a uniform
random subset of nmax points that belong in the tile boundaries
from the global dataset and adds them to the training dataset. If the
dataset has fewer than nmin data points for the tile, the tile bound-
aries are folded to merge it with neighboring tile. This process
is repeated until the tile has nmin number of points. Ultimately,
most of the tiles have regular boundaries, but for some tiles, espe-
cially those on the fringes, the boundaries can be irregular. Once
the preparation of training data is complete, we use cross-validation
to derive regression parameters for each tile; the complexity is now
only O(n2max) for each tile.
3. Retrieval of Data: With large datasets, even the mundane tasks,
such as retrieving training and test data during the input distribution
preparation and response-time estimation phases are challenging.
Quick data retrieval and processing is imperative here because both
of these stages are online, in the sense that they are evaluated when
the designer specifies the scenario.
WISE expedites this process by intelligently indexing the training
data off-line and the test data as it is created. Tiles are used here as
well: Each tile is assigned a tile-id, which is simply a string formed
by concatenating the tile’s boundaries in each dimension. All the
data points that lie in the tile boundaries are assigned the tile-id as a
key that is used for indexing. For the data preparation stage, WISE
performs the tile-id assignment and indexing along the dimensions
comprising the parents of most commonly used variables, and for
the regression phase, the tiling and indexing is performed for the
dimensions comprising the parents of the target variable. Because
the tile-ids use fixed length bins for most of the space, mapping of
a point to its tile can be performed in constant time for most of the
data-points using simple arithmetic operations.
4. Parallelization and Batching: We have carefully designed the
various stages in WISE to support parallelization and batching of
jobs that use similar or same data. In the training data preparation
stage, each entry in the dataset can be independently assigned its
tile-id based key because WISE uses regular sized tiles. Similarly,
the regression parameters for each tile can be learned independently
1Techniques such as in [10] can reduce the complexity for such N-
body problems but are still quite complex than the approximations
that WISE uses.
104
so on, until all the descendants of xi (except the target variable) are
updated. WISE cannot update the distribution of a descendant of xi
until the distribution of all of its ancestors that are descendant of
xi has been updated. WISE thus carefully orders the sequence of
the updates by traversing the CBN DAG breadth-first, beginning at
node xi.
WISE sequentially repeats this process for each statement in the
scenario specification. The updated dataset produced after each
statement serves as the input dataset for the next statement. Once
all the statements are executed, the dataset is the representative joint
distribution variables for the entire what-if scenario.
When the causal structure has ambiguities, WISE proceeds as
follows. When the edge between two variables is undirected, WISE
maintains the consistency by always updating the distribution of
one if the distribution of the other is updated. For latent variables
case, WISE assumes an imaginary variable, with directed edges to
variables a and b and uses the resulting structure to traverse the
graph while preparing the input distribution.
5.5 Estimating Response Time Distribution
Finding the new distribution of response time is also a case of
intervention effect evaluation process. We use a non-parametric
regression method to estimate the expected response-time distribu-
tion, instead of assigning a random value from the constrained em-
pirical distribution as in the previous section, because the designers
are interested in the expected values of the response time for each
request. In particular, we use a standard Kernel Regression (KR)
method, with a radial basis Kernel function (see [24, 26] for de-
tails) to estimate the response time for each request in the dataset.
To address the computational complexity, WISE applies the KR in
a piece-wise manner; the details follow in the next section.
5.6 Addressing the Computational Scalability
Because CDNs are complex systems, the response time may de-
pend on a large number of variables, and the dataset might com-
prise hundreds of millions of requests spanning a large multi-
dimensional space. To efficiently evaluate the what-if scenarios,
WISE must address how to efficiently organize and utilize the
dataset. In this section, we discuss our approach to these problems.
1. Curse of Dimensionality: As the number of dimensions (vari-
ables in the dataset) grow, exponentially more data is needed for
similar accuracy of estimation. WISE uses the CBN to mitigate this
problem. In particular, because when conditioned on its parents,
a variable is independent of all variables except its descendants,
we can use only the parents of the target variable in the regression
function. Because the cardinality of the parent-set would typically
be less than the total number of variables in the dataset, the accu-
racy of the regression function is significantly improved for a given
amount of data. Due to this, WISE can afford to use fewer training
data points with the regression function and still get good accuracy.
Also, because the time complexity for the KR method is O(kn3),
with k variables and n points in the training dataset, WISE’s tech-
nique results in significant computational speedup.
2. Overfitting: The density of the dataset from a real deployment
can be highly irregular; usually there are many points for combi-
nations of variable values that represent the normal network oper-
ation, while the density of dataset is sparser for combinations that
represent the fringe cases. Unfortunately, because the underlying
principle of most regression techniques is to find parameters that
minimize the errors on the training data, we can end up with param-
eters that minimize the error for high density regions of the dataset
but give poor results in the fringes—this problem is called overfit-
ting. The usual solution to this problem is introducing a smooth-
ness penalty in the objective function of the regression method, but
finding the right penalty function requires cross-validation, which
is usually at least quadratic in the size of the global dataset1. In the
case of CDNs, even one day of data may contain entries for mil-
lions of requests, which makes the quadratic complexity of these
algorithms inherently unscalable.
WISE uses piece-wise regression to address this problem. WISE
divides the dataset into small pieces, that we refer to as tiles and
performs regression independently for each tile. WISE further
prunes the global dataset to produce a training dataset so that the
density of training points is more or less even across all the tiles.
To decompose the dataset, WISE uses fixed-size buckets for each
dimension in the dataset for most of the variable value space. If the
bucket sizes are sufficiently small, having more data points beyond
a certain threshold does not contribute appreciably to the response-
time prediction accuracy. With this in mind, WISE uses two thresh-
olds nmin and nmax and proceeds with tile boundaries as follows.
WISE decides on a bucket width bi along each dimension i, and
forms boundaries in the space of the dataset at integer multiples of
bi along each dimension; for categorical variables, WISE uses each
category as a separate bucket. For each tile, WISE obtains a uniform
random subset of nmax points that belong in the tile boundaries
from the global dataset and adds them to the training dataset. If the
dataset has fewer than nmin data points for the tile, the tile bound-
aries are folded to merge it with neighboring tile. This process
is repeated until the tile has nmin number of points. Ultimately,
most of the tiles have regular boundaries, but for some tiles, espe-
cially those on the fringes, the boundaries can be irregular. Once
the preparation of training data is complete, we use cross-validation
to derive regression parameters for each tile; the complexity is now
only O(n2max) for each tile.
3. Retrieval of Data: With large datasets, even the mundane tasks,
such as retrieving training and test data during the input distribution
preparation and response-time estimation phases are challenging.
Quick data retrieval and processing is imperative here because both
of these stages are online, in the sense that they are evaluated when
the designer specifies the scenario.
WISE expedites this process by intelligently indexing the training
data off-line and the test data as it is created. Tiles are used here as
well: Each tile is assigned a tile-id, which is simply a string formed
by concatenating the tile’s boundaries in each dimension. All the
data points that lie in the tile boundaries are assigned the tile-id as a
key that is used for indexing. For the data preparation stage, WISE
performs the tile-id assignment and indexing along the dimensions
comprising the parents of most commonly used variables, and for
the regression phase, the tiling and indexing is performed for the
dimensions comprising the parents of the target variable. Because
the tile-ids use fixed length bins for most of the space, mapping of
a point to its tile can be performed in constant time for most of the
data-points using simple arithmetic operations.
4. Parallelization and Batching: We have carefully designed the
various stages in WISE to support parallelization and batching of
jobs that use similar or same data. In the training data preparation
stage, each entry in the dataset can be independently assigned its
tile-id based key because WISE uses regular sized tiles. Similarly,
the regression parameters for each tile can be learned independently
1Techniques such as in [10] can reduce the complexity for such N-
body problems but are still quite complex than the approximations
that WISE uses.
104
Page 7
and in parallel. In the input data preparation stage, WISE batches
the test and training data that belong in a tile and fetch the data from
the training data for all of these together. Finally, because WISE
uses piece-wise regression to evaluate the effects of intervention,
it can batch the test and training data for each tile; further because
the piece-wise computation are independent, they can take place in
parallel.
6. IMPLEMENTATION
We have implemented WISE with the Map-Reduce frame-
work [16] using the Sawzall logs processing language [22] and
Python Map-Reduce libraries. We chose this framework to best ex-
ploit the parallelization and batching opportunities offered by the
WISE design2. We have also implemented a fully-functional proto-
type for WISE using a Python front-end and a MySQL backend that
can be used for small scale datasets. We provide a brief overview
of the Map-Reduce based implementation here.
Most steps in WISE are implemented using a combination of one
or more of the four Map-Reduce patterns shown in Figure 5. WISE
uses filter pattern to obtain conditional subsets of dataset for var-
ious stages. WISE uses the Tile-id Assignment pattern for prepar-
ing the training data. We set the nmin and nmax thresholds to
20 and 50, respectively to achieve 2-5% confidence intervals. In
the input data preparation phase, the use-statement is implemented
using the filter pattern. The update-statements use update pattern
for applying the new values to the variable in the statement. If the
update-statement uses the INTERVENE keyword then WISE uses
the Training & Test Data Collation pattern to bring together the
relevant test and training data and update the distribution of the test
data in a batched manner. Each update- statement is immediately
followed by the Tile-id Assignment pattern because the changes in
the value of the data may necessitate re-assignment of the tile-id.
Finally, WISE uses the Training & Test Data Collation pattern for
piece-wise regression. Our Map-Reduce based implementation can
evaluate typical scenarios in about 5 minutes on a cluster of 50 PCs
while using nearly 500 GB of training data.
7. EVALUATING WISE FOR A REAL CDN
In this section, we describe our experience applying WISE to a
large dataset obtained from Google’s global CDN for Web-search
service. We start by briefly describing the CDN and the service
architecture. We also describe the dataset from this CDN and the
causal structure discovered using WCD. We also evaluate WISE’s
ability to predict response-time distribution for the what-if scenar-
ios.
7.1 Web-Search Service Architecture
Figure 6(a) shows Google’s Web-search service architecture.
The service comprises a system of globally distributed HTTP re-
verse proxies, referred to as Front End (FE) and a system of glob-
ally distributed clusters that house the Web servers and other core
services (the Back End, or BE). A DNS based request redirection
system redirects the user’s queries to one of the FEs in the CDN.
The FE process forwards the queries to the BE servers, which gen-
erate dynamic content based on the query. The FE caches static
portions of typical reply, and starts transmitting that part to the re-
questing user as it waits for reply from the BE. Once the BE replies,
the dynamic content is also transmitted to the user. The FE servers
may or may not be co-located in the same data center with the BE
2Hadoop [11] provides an open-source Map-Reduce library. Mod-
ern data-warehousing appliances, such the ones by Netezza [18],
can also exploit the parallelization in WISE design.
Figure 5: Map-Reduce patterns used inWISE implementation.
servers. If they are co-located, they can be considered to be on the
same local area network and the round-trip latency between them is
only a few milliseconds. Otherwise, the connectivity between the
FE and the BE is typically on a well-provisioned connection on the
public Internet. In this case the latency between the FE and BE can
be several hundred milliseconds.
The server response time for a request is the time between the
instance when the user issues the HTTP request and the instance
when the last byte of the response is received by the users. We esti-
mate this value as the sum of the round-trip time estimate obtained
from the TCP three-way handshake, and the time between the in-
stance when the request is received at the FE and when the last byte
of the response is sent by the FE to user. The key contributors to
server response time are: (i) the transfer latency of the request from
the user to the FE (ii) the transfer latency of request to the BE and
the transfer latency of sending the response from the BE to the FE;
(iii) processing time at the BE, (iv) TCP transfer latency of the re-
sponse from the FE to the client; and (v) any latency induced by
loss and retransmission of TCP segments.
Figure 6(b) shows the process by which a user’s Web search
query is serviced. This message exchange has three features that
affect service response time in subtle ways, making it hard to make
accurate “back-of-the-envelop calculations” in the general case:
1. Asynchronous transfer of content to the user. Once the TCP
handshake is complete, user’s browser sends an HTTP request con-
taining the query to the FE. While the FE waits on a reply from
the BE, it sends some static content to the user; this content—
essentially a “head start” on the transfer—is typically brief and
constitutes only a couple of IP packets. Once the FE receives the
response from the BE, it sends the response to the client and com-
pletes the request. A client may use the same TCP connection for
subsequent HTTP requests.
2. Spliced TCP connections. FE processes maintain several TCP
connections with the BE servers and reuse these connections for
forwarding user requests to the BE. FE also supports HTTP pipelin-
105
the test and training data that belong in a tile and fetch the data from
the training data for all of these together. Finally, because WISE
uses piece-wise regression to evaluate the effects of intervention,
it can batch the test and training data for each tile; further because
the piece-wise computation are independent, they can take place in
parallel.
6. IMPLEMENTATION
We have implemented WISE with the Map-Reduce frame-
work [16] using the Sawzall logs processing language [22] and
Python Map-Reduce libraries. We chose this framework to best ex-
ploit the parallelization and batching opportunities offered by the
WISE design2. We have also implemented a fully-functional proto-
type for WISE using a Python front-end and a MySQL backend that
can be used for small scale datasets. We provide a brief overview
of the Map-Reduce based implementation here.
Most steps in WISE are implemented using a combination of one
or more of the four Map-Reduce patterns shown in Figure 5. WISE
uses filter pattern to obtain conditional subsets of dataset for var-
ious stages. WISE uses the Tile-id Assignment pattern for prepar-
ing the training data. We set the nmin and nmax thresholds to
20 and 50, respectively to achieve 2-5% confidence intervals. In
the input data preparation phase, the use-statement is implemented
using the filter pattern. The update-statements use update pattern
for applying the new values to the variable in the statement. If the
update-statement uses the INTERVENE keyword then WISE uses
the Training & Test Data Collation pattern to bring together the
relevant test and training data and update the distribution of the test
data in a batched manner. Each update- statement is immediately
followed by the Tile-id Assignment pattern because the changes in
the value of the data may necessitate re-assignment of the tile-id.
Finally, WISE uses the Training & Test Data Collation pattern for
piece-wise regression. Our Map-Reduce based implementation can
evaluate typical scenarios in about 5 minutes on a cluster of 50 PCs
while using nearly 500 GB of training data.
7. EVALUATING WISE FOR A REAL CDN
In this section, we describe our experience applying WISE to a
large dataset obtained from Google’s global CDN for Web-search
service. We start by briefly describing the CDN and the service
architecture. We also describe the dataset from this CDN and the
causal structure discovered using WCD. We also evaluate WISE’s
ability to predict response-time distribution for the what-if scenar-
ios.
7.1 Web-Search Service Architecture
Figure 6(a) shows Google’s Web-search service architecture.
The service comprises a system of globally distributed HTTP re-
verse proxies, referred to as Front End (FE) and a system of glob-
ally distributed clusters that house the Web servers and other core
services (the Back End, or BE). A DNS based request redirection
system redirects the user’s queries to one of the FEs in the CDN.
The FE process forwards the queries to the BE servers, which gen-
erate dynamic content based on the query. The FE caches static
portions of typical reply, and starts transmitting that part to the re-
questing user as it waits for reply from the BE. Once the BE replies,
the dynamic content is also transmitted to the user. The FE servers
may or may not be co-located in the same data center with the BE
2Hadoop [11] provides an open-source Map-Reduce library. Mod-
ern data-warehousing appliances, such the ones by Netezza [18],
can also exploit the parallelization in WISE design.
Figure 5: Map-Reduce patterns used inWISE implementation.
servers. If they are co-located, they can be considered to be on the
same local area network and the round-trip latency between them is
only a few milliseconds. Otherwise, the connectivity between the
FE and the BE is typically on a well-provisioned connection on the
public Internet. In this case the latency between the FE and BE can
be several hundred milliseconds.
The server response time for a request is the time between the
instance when the user issues the HTTP request and the instance
when the last byte of the response is received by the users. We esti-
mate this value as the sum of the round-trip time estimate obtained
from the TCP three-way handshake, and the time between the in-
stance when the request is received at the FE and when the last byte
of the response is sent by the FE to user. The key contributors to
server response time are: (i) the transfer latency of the request from
the user to the FE (ii) the transfer latency of request to the BE and
the transfer latency of sending the response from the BE to the FE;
(iii) processing time at the BE, (iv) TCP transfer latency of the re-
sponse from the FE to the client; and (v) any latency induced by
loss and retransmission of TCP segments.
Figure 6(b) shows the process by which a user’s Web search
query is serviced. This message exchange has three features that
affect service response time in subtle ways, making it hard to make
accurate “back-of-the-envelop calculations” in the general case:
1. Asynchronous transfer of content to the user. Once the TCP
handshake is complete, user’s browser sends an HTTP request con-
taining the query to the FE. While the FE waits on a reply from
the BE, it sends some static content to the user; this content—
essentially a “head start” on the transfer—is typically brief and
constitutes only a couple of IP packets. Once the FE receives the
response from the BE, it sends the response to the client and com-
pletes the request. A client may use the same TCP connection for
subsequent HTTP requests.
2. Spliced TCP connections. FE processes maintain several TCP
connections with the BE servers and reuse these connections for
forwarding user requests to the BE. FE also supports HTTP pipelin-
105
Page 8
(a) Google’s Web-search Service Architecture
(b) Message Exchange
Figure 6: Google’s Web-search service architecture and mes-
sage exchange for a request on a fresh TCP connection.
ing, allowing the user to have multiple pending HTTP requests on
the same TCP connection.
3. Spurious retransmissions and timeouts. Because most Web
requests are short TCP transfers, the duration of the connection is
not sufficient to estimate a good value for the TCP retransmit timer
and many Web servers use default values for retransmits, or esti-
mate the timeout value from the initial TCP handshake round-trip
time. This causes spurious retransmits for users with slow access
links and high serialization delays for MTU sized packets.
7.2 Data
We use data from an existing network monitoring infrastructure
in Google’s network. Each FE cluster has network-level sniffers,
located between the FE and the load-balancers, that capture traffic
and export streams in tcpdump format. A similar monitoring infras-
tructure captures traffic in the BE. Although the FE and BE servers
use NTP for time synchronization, it is difficult to collate the traces
from the two locations using only the timestamps. Instead, we use
the hash of each client’s IP, port and part of query along with the
timestamp to collate the request between the FE and the BE. WISE
then applies the relevance tests (ref. Sec. 5.1) on the features in the
dataset collected in this manner. Table 1 describes the variables that
WISE found to be relevant to the service response-time variable.
7.3 Causal Structure in the Dataset
To obtain the causal structure, we use a small sampled data sub-
set collected on June 19, 2007, from several data center locations.
This dataset has roughly 25 million requests, from clients in 12,877
unique ASes.
We seed the WCD algorithm with the region and ts variables as
the no-cause variables. Figure 7 shows the causal structure that
WCD produces. Most of the causal relationships in Figure 7 are
straightforward and make intuitive sense in the context of network-
ing, but a few relationships are quite surprising. WCD detects a
relationship between the region and sB attribute (the size of the
result page); we found that this relationship exists due to the differ-
ences in the sizes of search response pages in different languages
region
rtt sP cP
sBfebe
rt
crP srP
febe_rtt
ts
tod
be_time
Figure 7: Inferred causal structure in the dataset. A → B
means A causes B.
and regions. Another unexpected relationship is between region,
cP and sP attributes; we found that this relationship exists due to
different MTU sizes in different parts of the world. Our dataset,
unfortunately, did not have load, utilization, or data center capacity
variables that could have allowed us to model the be_time variable.
All we observed was that the be_time distribution varied some-
what among the data centers. Overall, we find that WCD algorithm
not only discovers relationships that are faithful to how networks
operate but also discovers relationships that might escape trained
network engineers.
Crucially, note that many variables are not direct children of the
region, ts, fe or be variables. This means that when conditioned
on the respective parents, these variables are independent of the re-
gion, time, choice of FE and BE, and we can use training data from
past, different regions, and different FE and BE data centers to esti-
mate the distributions for these features! Further, while most of the
variables in the dataset are correlated, the in-degree for each vari-
able is smaller than the total number of variables. This reduces the
number of dimensions that WISE must consider for estimating the
value of the variables during scenario evaluation, allowing WISE to
produce accurate estimates, more quickly and with less data.
7.4 Response-Time Estimation Accuracy
Our primary metric for evaluation is prediction accuracy. There
are two sources of error in response-time prediction: (i) error in
response-time estimation function (Section 5.5) and (ii) inaccurate
input, or error in estimating a valid input distribution that is repre-
sentative of the scenario (Section 5.4). To isolate these errors, we
first evaluate the estimation accuracy alone and later consider the
overall accuracy for a complete scenario in Section 7.5.
To evaluate accuracy of the piece-wise regression method in
isolation we can try to evaluate a scenario: “What-if I make no
changes to the network?” This scenario is easy to specify with
WSL by not including any optional scenario update statements.
For example, a scenario specification with the following line:
USE WHERE country==deu
would produce an input distribution for the response-time estima-
tion function that is representative of users in Germany without any
error and any inaccuracies that arise would be due to regression
method. To demonstrate the prediction accuracy, we present re-
sults for three such scenarios:
(a) USE WHERE country==deu
(b) USE WHERE country==zaf
106
(b) Message Exchange
Figure 6: Google’s Web-search service architecture and mes-
sage exchange for a request on a fresh TCP connection.
ing, allowing the user to have multiple pending HTTP requests on
the same TCP connection.
3. Spurious retransmissions and timeouts. Because most Web
requests are short TCP transfers, the duration of the connection is
not sufficient to estimate a good value for the TCP retransmit timer
and many Web servers use default values for retransmits, or esti-
mate the timeout value from the initial TCP handshake round-trip
time. This causes spurious retransmits for users with slow access
links and high serialization delays for MTU sized packets.
7.2 Data
We use data from an existing network monitoring infrastructure
in Google’s network. Each FE cluster has network-level sniffers,
located between the FE and the load-balancers, that capture traffic
and export streams in tcpdump format. A similar monitoring infras-
tructure captures traffic in the BE. Although the FE and BE servers
use NTP for time synchronization, it is difficult to collate the traces
from the two locations using only the timestamps. Instead, we use
the hash of each client’s IP, port and part of query along with the
timestamp to collate the request between the FE and the BE. WISE
then applies the relevance tests (ref. Sec. 5.1) on the features in the
dataset collected in this manner. Table 1 describes the variables that
WISE found to be relevant to the service response-time variable.
7.3 Causal Structure in the Dataset
To obtain the causal structure, we use a small sampled data sub-
set collected on June 19, 2007, from several data center locations.
This dataset has roughly 25 million requests, from clients in 12,877
unique ASes.
We seed the WCD algorithm with the region and ts variables as
the no-cause variables. Figure 7 shows the causal structure that
WCD produces. Most of the causal relationships in Figure 7 are
straightforward and make intuitive sense in the context of network-
ing, but a few relationships are quite surprising. WCD detects a
relationship between the region and sB attribute (the size of the
result page); we found that this relationship exists due to the differ-
ences in the sizes of search response pages in different languages
region
rtt sP cP
sBfebe
rt
crP srP
febe_rtt
ts
tod
be_time
Figure 7: Inferred causal structure in the dataset. A → B
means A causes B.
and regions. Another unexpected relationship is between region,
cP and sP attributes; we found that this relationship exists due to
different MTU sizes in different parts of the world. Our dataset,
unfortunately, did not have load, utilization, or data center capacity
variables that could have allowed us to model the be_time variable.
All we observed was that the be_time distribution varied some-
what among the data centers. Overall, we find that WCD algorithm
not only discovers relationships that are faithful to how networks
operate but also discovers relationships that might escape trained
network engineers.
Crucially, note that many variables are not direct children of the
region, ts, fe or be variables. This means that when conditioned
on the respective parents, these variables are independent of the re-
gion, time, choice of FE and BE, and we can use training data from
past, different regions, and different FE and BE data centers to esti-
mate the distributions for these features! Further, while most of the
variables in the dataset are correlated, the in-degree for each vari-
able is smaller than the total number of variables. This reduces the
number of dimensions that WISE must consider for estimating the
value of the variables during scenario evaluation, allowing WISE to
produce accurate estimates, more quickly and with less data.
7.4 Response-Time Estimation Accuracy
Our primary metric for evaluation is prediction accuracy. There
are two sources of error in response-time prediction: (i) error in
response-time estimation function (Section 5.5) and (ii) inaccurate
input, or error in estimating a valid input distribution that is repre-
sentative of the scenario (Section 5.4). To isolate these errors, we
first evaluate the estimation accuracy alone and later consider the
overall accuracy for a complete scenario in Section 7.5.
To evaluate accuracy of the piece-wise regression method in
isolation we can try to evaluate a scenario: “What-if I make no
changes to the network?” This scenario is easy to specify with
WSL by not including any optional scenario update statements.
For example, a scenario specification with the following line:
USE WHERE country==deu
would produce an input distribution for the response-time estima-
tion function that is representative of users in Germany without any
error and any inaccuracies that arise would be due to regression
method. To demonstrate the prediction accuracy, we present re-
sults for three such scenarios:
(a) USE WHERE country==deu
(b) USE WHERE country==zaf
106
Page 12
the complexity and using past behavior to predict TCP through-
put. He et al. [12] evaluate predictability using short-term history,
and Mirza et al. [17] use machine-learning techniques to estimate
TCP throughput — these techniques tend to be more accurate. We
also use machine-learning and statistical inference in our work, but
techniques of [17] are not directly applicable because they rely on
estimating path properties immediately before making a prediction.
Further, they do not provide a framework for evaluating what-if
scenarios. The parametric techniques, as we show in Section 7.4,
unfortunately are not very accurate for predicting response-time.
A recent body of work has explored use of Bayesian inference
for fault and root-cause diagnosis. SCORE [15] uses spatial cor-
relation and shared risk group techniques to find the best possi-
ble explanation for observed faults in the network. Shrink [14]
extends this model to a probabilistic setting, because the depen-
dencies among the nodes may not be deterministic due to incom-
plete information or noisy measurements. Sherlock [4] additionally
finds causes for poor performance and also models fail-over and
load-balancing dependencies. Rish et al. [21] combine dependency
graphs with active probing for fault-diagnosis. None of these work,
however, address evaluating what-if scenarios for networks.
11. CONCLUSION
Network designers must routinely answer questions about how
specific deployment scenarios affect the response time of a service.
Without a rigorous method for evaluating such scenarios, the net-
work designers must rely on ad hoc methods or resort to costly field
deployments to test their ideas. This paper has presented WISE, a
tool for specifying and accurately evaluating what-if deployment
scenarios for content distribution networks. To our knowledge,
WISE is the first tool to automatically derive causal relationships
from Web traces and apply statistical intervention to predict net-
worked service performance. Our evaluation demonstrates that
WISE is both fast and accurate: it can predict response time distri-
butions in “what if” scenarios to within a 11% error margin. WISE
is also easy to use: its scenario specification language makes it easy
to specify complex configurations in just a few lines of code.
In the future, we plan to use similar techniques to explore how
causal inference can help network designers better understand the
dependencies that transcend beyond just performance related is-
sues in their networks. WISE represents an interesting point in the
design space because it leverages almost no domain knowledge to
derive causal dependencies; perhaps what-ifscenario evaluators in
other domains that rely almost exclusively on domain knowledge
(e.g., [8]) could also leverage statistical techniques to improve ac-
curacy and efficiency.
Acknowledgments
We would like to thank Andre Broido and Ben Helsley at Google,
and anonymous reviewers for the valuable feedback that helped im-
prove several aspects of our work. We would also like to thank Jeff
Mogul for sharing source code for the methods in [2].
12. REFERENCES
[1] Akamai Technologies. www.akamai.com
[2] M. Arlitt, B. Krishnamurthy, J. Mogul. Predicting
short-transfer latency from TCP arcana: A trace-based
validation. IMC’2005.
[3] L.A. Barroso, J. Dean, U. Holzle. Web Search for a Planet:
The Google Cluster Architecture. IEEE Micro. Vol. 23, No.
2. pp 22–28
[4] P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. Maltz, M.
Zhang. Towards Highly Reliable Enterprise Network
Services via Inference of Multi-level Dependencies. ACM
SIGCOMM 2007.
[5] N. Cardwell, S. Savage, T. Anderson. Modeling TCP
Latency. IEEE Infocomm 2000.
[6] G. Cooper. A Simple Constraint-Based Algorithm for
Efficiently Mining Observational Databases for Causal
Relationships. Data Mining and Knowledge Discovery 1,
203-224. 1997.
[7] Emulab Network Testbed. http://www.emulab.net
[8] N. Feamster and J. Rexford. Network-Wide Prediction of
BGP Routes. IEEE/ACM Transactions on Networking. Vol.
15. pp. 253–266
[9] M. Freedman, E. Freudenthal, D. Mazieres. Democratizing
Content Publication with Coral. USENIX NSDI 2004.
[10] A. Gray, A. Moore, ‘N -Body’ Problems in Statistical
Learning. Advances in Neural Information Processing
Systems 13. 2000.
[11] Lucene Hadoop. http://lucene.apache.org/hadoop/
[12] Q. He, C. Dovrolis, M. Ammar. On the Predictability of
Large Transfer TCP Throughput. ACM SIGCOMM 2006.
[13] A. Barbir, et al. Known Content Network Request Routing
Mechanisms. IETF RFC 3568. July 2003.
[14] S. Kandula, D. Katabi, J. Vasseur. Shrink: A Tool for Failure
Diagnosis in IP Networks. MineNet Workshop SIGCOMM
2005.
[15] R. Kompella, J. Yates, A. Greenberg, A. Snoeren. IP Fault
Localization Via Risk Modeling. USENIX NSDI 2005.
[16] J. Dean and S. Ghemawat. MapReduce: Simplified Data
Processing on Large Clusters. USENIX OSDI 2004.
[17] M. Mirza, J. Sommers, P. Barford, X. Zhu. A Machine
Learning Approach to TCP Throughput Prediction. ACM
SIGMETRICS 2007.
[18] Netezza http://www.netezza.com/
[19] J. Padhye, V. Firoiu, D. Towsley, and J. Kurose. Modeling
TCP Throughput: A Simple Model and its Empirical
Validation. IEEE/ACM Transactions on Networking. Vol 8.
pp. 135-145
[20] J. Pearl. Causality: Models, Reasoning, and Inference.
Cambridge University Press. 2003.
[21] I. Rish, M. Brodie, S. Ma. Efficient Fault Diagnosis Using
Probing. AAAI Spring Symposium on DMDP. 2002.
[22] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan.
Interpreting the Data: Parallel Analysis with Sawzall.
Scientific Programming Journal. Vol. 13. pp. 227–298.
[23] P. Sprites, C. Glymour. An Algorithm for fast recovery of
sparse causal graphs. Social Science Computer Review 9.
USENIX Symposium on Internet Technologies and Systems.
1997.
[24] M. Tariq, A. Zeitoun, V. Valancius, N. Feamster, M. Ammar.
Answering “What-if” Deployment and Configuration
Questions with WISE. Georgia Tech Technical Report
GT-CS-08-02. February 2008.
[25] L. Wasserman. All of Statistics: A Concise Course in
Statistical Inference. Springer Texts in Statistics. 2003.
[26] J. Wolberg. Data Analysis Using the Method of Least
Squares. Springer. Feb 2006.
110
put. He et al. [12] evaluate predictability using short-term history,
and Mirza et al. [17] use machine-learning techniques to estimate
TCP throughput — these techniques tend to be more accurate. We
also use machine-learning and statistical inference in our work, but
techniques of [17] are not directly applicable because they rely on
estimating path properties immediately before making a prediction.
Further, they do not provide a framework for evaluating what-if
scenarios. The parametric techniques, as we show in Section 7.4,
unfortunately are not very accurate for predicting response-time.
A recent body of work has explored use of Bayesian inference
for fault and root-cause diagnosis. SCORE [15] uses spatial cor-
relation and shared risk group techniques to find the best possi-
ble explanation for observed faults in the network. Shrink [14]
extends this model to a probabilistic setting, because the depen-
dencies among the nodes may not be deterministic due to incom-
plete information or noisy measurements. Sherlock [4] additionally
finds causes for poor performance and also models fail-over and
load-balancing dependencies. Rish et al. [21] combine dependency
graphs with active probing for fault-diagnosis. None of these work,
however, address evaluating what-if scenarios for networks.
11. CONCLUSION
Network designers must routinely answer questions about how
specific deployment scenarios affect the response time of a service.
Without a rigorous method for evaluating such scenarios, the net-
work designers must rely on ad hoc methods or resort to costly field
deployments to test their ideas. This paper has presented WISE, a
tool for specifying and accurately evaluating what-if deployment
scenarios for content distribution networks. To our knowledge,
WISE is the first tool to automatically derive causal relationships
from Web traces and apply statistical intervention to predict net-
worked service performance. Our evaluation demonstrates that
WISE is both fast and accurate: it can predict response time distri-
butions in “what if” scenarios to within a 11% error margin. WISE
is also easy to use: its scenario specification language makes it easy
to specify complex configurations in just a few lines of code.
In the future, we plan to use similar techniques to explore how
causal inference can help network designers better understand the
dependencies that transcend beyond just performance related is-
sues in their networks. WISE represents an interesting point in the
design space because it leverages almost no domain knowledge to
derive causal dependencies; perhaps what-ifscenario evaluators in
other domains that rely almost exclusively on domain knowledge
(e.g., [8]) could also leverage statistical techniques to improve ac-
curacy and efficiency.
Acknowledgments
We would like to thank Andre Broido and Ben Helsley at Google,
and anonymous reviewers for the valuable feedback that helped im-
prove several aspects of our work. We would also like to thank Jeff
Mogul for sharing source code for the methods in [2].
12. REFERENCES
[1] Akamai Technologies. www.akamai.com
[2] M. Arlitt, B. Krishnamurthy, J. Mogul. Predicting
short-transfer latency from TCP arcana: A trace-based
validation. IMC’2005.
[3] L.A. Barroso, J. Dean, U. Holzle. Web Search for a Planet:
The Google Cluster Architecture. IEEE Micro. Vol. 23, No.
2. pp 22–28
[4] P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. Maltz, M.
Zhang. Towards Highly Reliable Enterprise Network
Services via Inference of Multi-level Dependencies. ACM
SIGCOMM 2007.
[5] N. Cardwell, S. Savage, T. Anderson. Modeling TCP
Latency. IEEE Infocomm 2000.
[6] G. Cooper. A Simple Constraint-Based Algorithm for
Efficiently Mining Observational Databases for Causal
Relationships. Data Mining and Knowledge Discovery 1,
203-224. 1997.
[7] Emulab Network Testbed. http://www.emulab.net
[8] N. Feamster and J. Rexford. Network-Wide Prediction of
BGP Routes. IEEE/ACM Transactions on Networking. Vol.
15. pp. 253–266
[9] M. Freedman, E. Freudenthal, D. Mazieres. Democratizing
Content Publication with Coral. USENIX NSDI 2004.
[10] A. Gray, A. Moore, ‘N -Body’ Problems in Statistical
Learning. Advances in Neural Information Processing
Systems 13. 2000.
[11] Lucene Hadoop. http://lucene.apache.org/hadoop/
[12] Q. He, C. Dovrolis, M. Ammar. On the Predictability of
Large Transfer TCP Throughput. ACM SIGCOMM 2006.
[13] A. Barbir, et al. Known Content Network Request Routing
Mechanisms. IETF RFC 3568. July 2003.
[14] S. Kandula, D. Katabi, J. Vasseur. Shrink: A Tool for Failure
Diagnosis in IP Networks. MineNet Workshop SIGCOMM
2005.
[15] R. Kompella, J. Yates, A. Greenberg, A. Snoeren. IP Fault
Localization Via Risk Modeling. USENIX NSDI 2005.
[16] J. Dean and S. Ghemawat. MapReduce: Simplified Data
Processing on Large Clusters. USENIX OSDI 2004.
[17] M. Mirza, J. Sommers, P. Barford, X. Zhu. A Machine
Learning Approach to TCP Throughput Prediction. ACM
SIGMETRICS 2007.
[18] Netezza http://www.netezza.com/
[19] J. Padhye, V. Firoiu, D. Towsley, and J. Kurose. Modeling
TCP Throughput: A Simple Model and its Empirical
Validation. IEEE/ACM Transactions on Networking. Vol 8.
pp. 135-145
[20] J. Pearl. Causality: Models, Reasoning, and Inference.
Cambridge University Press. 2003.
[21] I. Rish, M. Brodie, S. Ma. Efficient Fault Diagnosis Using
Probing. AAAI Spring Symposium on DMDP. 2002.
[22] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan.
Interpreting the Data: Parallel Analysis with Sawzall.
Scientific Programming Journal. Vol. 13. pp. 227–298.
[23] P. Sprites, C. Glymour. An Algorithm for fast recovery of
sparse causal graphs. Social Science Computer Review 9.
USENIX Symposium on Internet Technologies and Systems.
1997.
[24] M. Tariq, A. Zeitoun, V. Valancius, N. Feamster, M. Ammar.
Answering “What-if” Deployment and Configuration
Questions with WISE. Georgia Tech Technical Report
GT-CS-08-02. February 2008.
[25] L. Wasserman. All of Statistics: A Concise Course in
Statistical Inference. Springer Texts in Statistics. 2003.
[26] J. Wolberg. Data Analysis Using the Method of Least
Squares. Springer. Feb 2006.
110
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
24 Readers on Mendeley
by Discipline
4% Engineering
by Academic Status
33% Ph.D. Student
33% Researcher (at a non-Academic Institution)
13% Researcher (at an Academic Institution)
by Country
38% United States
25% Japan
8% United Kingdom



