Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring
- ISBN: 9781450308007
- DOI: 10.1145/2018602.2018615
Abstract
Content distribution networks (CDNs) need to make decisions, such as server selection and routing, to improve performance for their clients. The performance may be limited by various factors such as packet loss in the network, a small receive buffer at the client, or constrained server CPU and disk resources. Conventional measurement techniques are not effective for distinguishing these performance problems: application-layer logs are too coarse-grained, while network-level traces are too expensive to collect all the time. We argue that passively monitoring the transport-level statistics in the server's network stack is a better approach. This paper presents a tool for monitoring and analyzing TCP statistics, and an analysis of a CoralCDN node in PlanetLab for six weeks. Our analysis shows that more than 10% of connections are server-limited at least 40% of the time, and many connections are limited by the congestion window despite no packet loss. Still, we see that clients in 377 Autonomous Systems (ASes) experience persistent packet loss. By separating network congestion from other performance problems, our analysis provides a much more accurate view of the performance of the network paths than what is possible with server logs alone.
Author-supplied keywords
Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring
through TCP-Level Monitoring
Peng Sun, Minlan Yu, Michael J. Freedman, Jennifer Rexford
Princeton University
{pengsun, minlanyu, mfreed, jrex}@cs.princeton.edu
ABSTRACT
Content distribution networks (CDNs) need to make deci-
sions, such as server selection and routing, to improve per-
formance for their clients. The performance may be lim-
ited by various factors such as packet loss in the network, a
small receive buffer at the client, or constrained server CPU
and disk resources. Conventional measurement techniques
are not effective for distinguishing these performance prob-
lems: application-layer logs are too coarse-grained, while
network-level traces are too expensive to collect all the time.
We argue that passively monitoring the transport-level statis-
tics in the server’s network stack is a better approach.
This paper presents a tool for monitoring and analyzing
TCP statistics, and an analysis of a CoralCDN node in Plan-
etLab for six weeks. Our analysis shows that more than 10%
of connections are server-limited at least 40% of the time,
and many connections are limited by the congestion window
despite no packet loss. Still, we see that clients in 377 Au-
tonomous Systems (ASes) experience persistent packet loss.
By separating network congestion from other performance
problems, our analysis provides a much more accurate view
of the performance of the network paths than what is possi-
ble with server logs alone.
1. INTRODUCTION
Content distribution networks (CDNs) run replicated
servers to deliver content to a large number of clients.
To optimize data-transfer performance, CDNs need to
pick the right server for each group of clients, and se-
lect routing paths that traverse the Internet quickly [14].
To make better decisions, CDNs should first identify the
performance bottlenecks. CDNs can then optimize their
software or upgrade their machines if the servers are
limiting the performance, work with their ISPs if net-
work performance is spotty, or notify clients if a small
receive buffer is the bottleneck (e.g., through an em-
bedded message in the Web page, or a modified HTTP
header).
To diagnose performance problems, CDNs need to
collect and analyze measurement data. However, ex-
isting measurement techniques are either too coarse-
grained for diagnosing problems, or too expensive for
achieving good coverage.
Application-layer logs [5, 6], while useful for cap-
turing download times and average throughput, are too
coarse-grained to uncover why clients experience perfor-
mance problems. Server logs cannot distinguish whether
a connection is limited by network congestion, slow ser-
ver writes, a small receive buffer, or a long round-trip
time (RTT). This also makes it difficult to diagnose con-
gestion by correlating performance across connections
[12], because many connections are limited by other fac-
tors.
Network-layer packet traces [16, 17] are useful
for measuring packet loss and inferring TCP connec-
tion state (e.g., slow start and receive window size), but
they are too expensive to capture all the time. In ad-
dition, real-time inference of internal TCP state (e.g.,
number of bytes in the send buffer, whether a trans-
fer is congestion-window limited, etc.) from in-network
traces is challenging, especially if the variant of TCP
running on each end point is not known in advance.
Active probing is effective for inferring the proper-
ties of network paths [13, 11, 14, 10]. However, these
techniques rely on a representative set of nodes to launch
probe traffic, and introduce extra load on the network.
In addition, active probing does not provide direct vis-
ibility into bottlenecks on the server machines.
Instead of, or in addition to these techniques, we be-
lieve CDNs should capitalize on transport-layer statis-
tics [15] readily available in the server network stack [1].
These statistics directly reveal whether a connection is
constrained by the server (e.g., too little data in the
send buffer), the congestion window (e.g., a small win-
dow during slow start), network congestion (e.g., packet
loss), or the client buffer (e.g., a small receive window).
We built a measurement tool to collect and analyze the
TCP statistics, and applied our tool to CoralCDN [8]
running on PlanetLab. In addition to characterizing
performance bottlenecks, we identified congested net-
work paths by correlating packet losses across connec-
tions to clients in the same AS. This is more accurate
than the analysis based on throughput (e.g., from server
logs), since many connections are not limited by net-
The remainder of the paper is structured as follows.
We first discuss the design and implementation of our
in-stack monitoring tool in Section 2. Then, Section 3
presents the results from applying our tool to Coral-
CDN. These specific results do not necessarily apply to
other CDNs, since PlanetLab servers are resource con-
strained. Still, we believe our techniques are broadly
useful for other CDNs and cloud services in understand-
ing and improving user-perceived performance. In Sec-
tion 4 we demonstrate how to identify congested net-
work paths by correlating packet losses across connec-
tions to clients in the same AS. We conclude the paper
in Section 5.
2. MEASUREMENT FRAMEWORK
In this section, we first describe how we collect in-
stack information. Then we show how to classify the
performance problems for each connection, and use two
example traces to illustrate different classes of perfor-
mance bottlenecks. Our tool further correlates across
the network performance problems affecting clients in
the same AS, as discussed in Section 4.
2.1 Measuring TCP Statistics
Our measurement tool polls the network stack to col-
lect connection-level statistics via web100 tool [3]. Web-
100 is a kernel patch that extracts the already-existing
TCP statistics, and offers an API to access their values
in user space. The TCP-level statistics we collect are
shown in Table 1. They directly tell the performance
problems of each connection, with less overhead than
packet traces and more details than CDN logs.
The data include two types of statistics: (i) instan-
taneous snapshots (e.g., Cwin, the current size of the
congestion window) and (ii) cumulative counters (e.g.,
RwinLimitTime, the total time the connection has been
receive-window limited). Figure 1 shows two connec-
tions from our measurements of CoralCDN, and they
illustrate how the TCP statistics evolve over the life of
the connection. Furthermore, our tool is lightweight. It
periodically polls those statistics variables every 50ms,
and generates less than 200MB of data per server per
day in our measurement of CoralCDN.
2.2 TCP Performance Classifier
With the TCP-level statistics, we then characterize
the performance limitations for each connection. There
are four categories of performance limitations: the net-
work path, the server network stack, the clients, and
the CDN server applications. Since we have direct ac-
cess to the TCP stack, we can easily distinguish these
performance limitations using the TCP-level statistics
as summarized in Table 2.
(i) Network Path: The network congestion can
Statistics Definition
Cwin Current congestion window
Rwin Current receive window
BytesInFlight # of bytes sent but not ACKed
BytesInSndBuf # of bytes written but not ACKed
SmoothedRTT Smoothed RTT computed by TCP
BytesWritten Cumulative # of bytes written by app
BytesSent Cumulative # of bytes sent
PktsRetrans Cumulative # of pkts retransmitted
RwinLimitTime Cumulative time that a connection
is limited by receive window
CwinLimitTime Cumulative time that a connection
is limited by congestion window
Table 1: Key TCP Statistics in our Tool
limit the performance of the connection by a small con-
gestion window. Our tool detects congestion when (a)
the connection is limited by congestion window and (b)
a packet loss has occurred (i.e., the number of retrans-
missions has increased). When the connection is no
longer limited by the congestion window or the conges-
tion window size returns to its size before the loss, we
mark the end of the impact of loss. In Figure 1(a), a
packet loss happens at around 12 seconds, causing the
decrease of the congestion window. The small conges-
tion window recovers for the next five seconds. During
this period, the key performance limitation is the con-
gestion window size after packet loss.
(ii) Server Network Stack: The congestion win-
dow can limit performance even when no losses occur,
particularly at the beginning of a connection (i.e., dur-
ing slow start). Our tool detects this situation when
(a) the congestion window limits performance and (b)
no loss has occurred or the impact of a previous loss
has ended. The period from 0s to 8s in Figure 1(a) and
the period from 0s to 2s in Figure 1(b) show examples.
During the time, the small congestion window size is
the dominant performance limitation.
(iii) Clients: A small receive window size can also
limit the performance. We detect this situation when
the receive window is the bottleneck of the connection.
The connection in Figure 1(a) is receive-window limited
from 8s to 12s; the receive-window size drops greatly at
8s, limiting the data the server can send.
(iv) CDN Server Applications: The CDN server
may limit the performance because a slow CPU or scarce
disk resources constrains the amount of data written
into the TCP send buffer. We classify a connection
as application limited when the server has less data in
the send buffer than the congestion and receive win-
dow sizes (i.e., the connection is neither congestion-
window limited nor receive-window limited). Starting
from about 3s in Figure 1(b), the application does not
have enough data to send. The BytesInSndBuf (marker
+) and BytesInFlight (marker4, overlapping) are sma-
ller than the congestion and receive window sizes.
During each 50ms interval, a connection may have
2
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



