Characterizing Application Runtime Behavior from System Logs and Metrics
Abstract
Large-scale systems are heavily-shared resource environments where a mix of applications run concurrently and compete for network and storage resources. It is essential to characterize the runtime behavior of these applications in order to provision system resources and understand the impact of resource contention on an applications performance. In this paper, we study the use of zero- and low-overhead system logs and other system metric data for characterizing the runtime behavior of several applications. We present our preliminary work on estimating an applications I/O demands by observing its file system usage patterns over multiple runs, and on estimating an applications network utilization by observing link-layer error logs. We also present preliminary findings on using such information in making context-sensitive scheduling decisions that minimize potentially negative interactions between applications competing for shared resources. Our analysis is based on four months of system log data collected on one of the worlds largest supercomputing facilities, the Jaguar XT5 petaflop system at Oak Ridge National Laboratory.
Author-supplied keywords
Characterizing Application Runtime Behavior from System Logs and Metrics
from System Logs and Metrics
Raghul Gunasekaran, David Dillow,
Galen Shipman
Oak Ridge National Laboratory
{gunasekaranr,dillowda,gshipman}@ornl.gov
Richard Vuduc, Edmond Chow
Georgia Institute of Technology
{richie,echow}@cc.gatech.edu
ABSTRACT
Large-scale systems are heavily-shared resource environments
where a mix of applications run concurrently and compete
for network and storage resources. It is essential to char-
acterize the runtime behavior of these applications in order
to provision system resources and understand the impact
of resource contention on an application’s performance. In
this paper, we study the use of zero- and low-overhead sys-
tem logs and other system metric data for characterizing
the runtime behavior of several applications. We present
our preliminary work on estimating an application’s I/O de-
mands by observing its file system usage patterns over mul-
tiple runs, and on estimating an application’s network uti-
lization by observing link-layer error logs. We also present
preliminary findings on using such information in making
context-sensitive scheduling decisions that minimize poten-
tially negative interactions between applications competing
for shared resources. Our analysis is based on four months
of system log data collected on one of the world’s largest
supercomputing facilities, the Jaguar XT5 petaflop system
at Oak Ridge National Laboratory.
Categories and Subject Descriptors
C.4 [Performance of Systems]: Performance Attributes
Keywords
Network, File system, Monitoring, Performance
1. INTRODUCTION
System logs and other collection of system metric data are
a rich source of information on the health of large-scale com-
puting systems. In this paper, we study the use of this type
of information for characterizing application runtime behav-
ior, such as an application’s network and I/O utilization.
Traditionally, application behavior is best characterized and
benchmarked using fine-grained performance tools on a qui-
escent machine and is undertaken by application develop-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CACHES ’11, June 4th, 2011, Tucson, Arizona USA
Copyright 2011 ACM 978-1-4503-0760-4/11/06 ...$10.00.
ers in conjunction with performance experts. This study is
from the point-of-view of system administrators who have a
wealth of system log data and who are interested in maxi-
mizing total application throughput. This is complementary
to the traditional approach and has the following merits:
• characterization of an application in situ, that is, ap-
plications running with the specific inputs and param-
eters of actual users
• an ability to capture effects on an application due to its
interaction with others running on the same machine
• characterization of all applications with significant run-
time on a machine, rather than only those that have
been benchmarked individually.
System log events and metrics collected regularly over
time and aggregated over the entire machine is not identified
with specific applications, except for certain error messages.
This data, however, can be combined with scheduler logs
so that it is known which applications were running at the
time a specific measurement was taken. With enough data,
it may be possible to tease apart the behavior of individual
applications. Large quantities of data are critical in order
to identify the one or more usual ways that an application
is run. The data is very noisy in the sense that runs of
the same application with different parameters may lead to
very different network and I/O demands, as well as large
variations in other behaviors. We lessen the effect of these
variations by associating each application run with its user,
with the assumption that an application’s behavior under
a given user is more uniform in comparison to applications
considered independent of users. We also group application
runs into classes (such as large and small jobs), with the
similar assumption that application behavior within a class
is more uniform.
It is important to characterize an application in situ be-
cause an application’s runtime characteristics are dependent
on the specific user and task. Users, based on their science
needs, define various parameters at runtime, such as the
number of compute nodes, dimensions of the compute grid,
I/O usage (frequency of checkpointing), all of which deter-
mine the true runtime characteristic of an application. In
our observation of the jobs submitted to the Jaguar XT5,
every user has one or more fixed job allocation models. Mon-
itoring an individual user’s behavior over multiple runs pro-
vides insight on the compute and resource utilization charac-
teristics of the specific“user-application.” It is these runtime
ical towards provisioning of system resources and efficient
scheduling of jobs.
We emphasize that we use zero- and negligible-overhead
system metrics so that application and system performance
is not impacted. In traditional benchmarking approaches, a
number of common tools [6, 2, 1, 11] are used to characterize
applications in terms of compute performance, communica-
tion patterns, and I/O access patterns. These tools provide
fine-grained details of application behavior, which is much
needed for optimizing application libraries for specific com-
pute platforms. These tools, however, are not suited for con-
tinuous monitoring of applications on the compute platform
because the compute overhead and bandwidth requirements
are significant. The compute overhead ranges between 2%-
8% for tools that scale on current petaflop systems. Band-
width and I/O requirements for collecting application trace
information are also large. As we move towards exa-scale,
even a 2% compute overhead is significant; bandwidth is an
even more scarce resource.
In summary, we propose to profile application runtime
characteristics using system logs and metrics that have zero
or negligible overhead. In comparison to traditional tools,
our approach does not provide fine-grained details but can
provide some insight into application behavior to support
runtime decisions to increase the overall throughput of the
machine. There are two primary contributions in the paper.
First, we present our preliminary research by identifying a
few metrics that can be used towards characterizing applica-
tions runtime behavior. Second, we present our preliminary
findings and approach towards using application characteris-
tics for enabling scheduling decisions and anomaly detection
by presenting a few examples. The work presented in this
paper is a step towards understanding the feasibility of using
low-level system metrics to estimate application character-
istics. Our analysis is based on four months of system log
data collected on one of the world’s largest supercomput-
ing facilities, the Jaguar XT5 petaflop system at Oak Ridge
National Laboratory.
The long-term goal of our work is to use low-overhead sys-
tem metric data to understand and model the behavior of
applications when they are run simultaneously on large-scale
shared-resource machines. Large-scale systems are a heavily
shared resource environment, where a number of applica-
tions run concurrently and compete for the same resources,
such as network and I/O bandwidth. The plot in Fig. 1
illustrates the number of scientific applications running con-
currently on the Jaguar petaflop system (over 18k compute
nodes), observed during the year 2010. It is usual for 10
or more applications to be running simultaneously, where
typically one or more applications run on a large number
of nodes, and several other applications run on a smaller
number of nodes. Although every application is allocated
a set of dedicated compute nodes, the underlying 3D torus
interconnect and the file system are shared resources. In
this situation, it is not unusual for applications to be im-
pacted by other applications running concurrently, resulting
in longer runtimes.
This understanding of applications runtime behavior can
lead to the better design of context-aware schedulers for min-
imizing contention for the platform’s shared resources and
thus improve overall system throughput. Models of the in-
teraction of applications could also lead to improved design
of hardware and software stacks for future throughput ma-
chines, including heterogeneous architectures.
0
5
10
15
20
25
30
1-10 11-20 21-30 31-40 41-50 51-60 >60
%
o
f C
om
pu
te
T
im
e
(ye
ar:
20
10
)
# of Apps Sharing Compute Platform
% of Time Apps Running Concurrently
Figure 1: Number of application running concurrently
on Jaguar XT5
2. THE COMPUTE INFRASTRUCTURE
Jaguar, a Cray XT5 system, with a peak performance
of 2.3 petaflops/second is the primary compute platform
at the Oak Ridge Leadership Computing Facility (OLCF).
The system is comprised of 18688 compute nodes, where
each compute node consists of two 2.6 GHz hex-core AMD
Opteron processors and 16 GB of memory. Nodes are con-
nected via a SeaStar2+ interconnect router in a 3D torus
topology. Each node has a SeaStar router, and each router
provides six network links connected to six neighbor nodes.
The peak bidirectional bandwidth of each link is 9.6 GB/s
with a sustained bandwidth of over 6 GB/s. Apart from
the compute nodes there are two types of service nodes, I/O
nodes which connect to the storage system and login nodes
for compiling and launching jobs. In Jaguar XT5, there are
192 I/O nodes connected to a center-wide shared file system,
named Spider.
Spider [10] is a Lustre-based storage cluster of 96 DDN
S2A9900 RAID controllers with an aggregate bandwidth of
240 GB/s and over 10 petabytes of storage from 13,440 1-
terabyte SATA drives. Each controller has 10 SAS channels
through which the backend disk drives are connected, and
the drives are RAID 6 formatted in an 8+2 configuration.
Each controller has two dual-port 4x DDR IB HCAs for host
side connectivity. Access to the RAID storage is through the
192 Lustre Object Storage Servers (OSS) connected to the
controllers over InfiniBand. Each OSS is a Dell dual-socket
quad-core server with 16 GB of memory. The compute plat-
forms connect to the storage infrastructure over a multistage
InfiniBand network, referred to as SION (Scalable I/O net-
work), which connects all OLCF compute resources.
For our study, we analyzed the logs from Jaguar XT5.
Apart from the general syslog, Cray XT5 generates the con-
sole, netwatch, and consumer log, which are collectively
called the RAS (Reliability, Serviceability and Availability)
logs. For this preliminary study we focus on the netwatch
log, which captures errors on the 3D torus interconnect.
Also, we present a few error messages from the console log
which are representative of application behavior. The job
scheduler’s log, referred to as the apsched log, provides in-
formation on applications that are currently scheduled to
ate points. The frequency of checkpointing usually drives
the I/O demand of an application. The utilization will also
vary depending on how the user performs parallel I/O, e.g.,
whether or not files are shared between processes. Project-
ing individual users peak bandwidth (or I/O operations per
second) and frequency of file system usage will help provision
system resources more efficiently.
Figure 2: Plot of file system usage observed on a given
day.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20 25 30 35 40 45 50
D
is
tri
bu
tio
n
P(
X<
x)
Bandwidth GB/s
Figure 3: CDF of write bandwidth usage derived from
multiple runs of the application (App-1).
In Figure 2, we present a typical usage of the spider file
system in a day. The figure has four subplots with each sub-
plot presenting the write bandwidth usage observed during
a period of six hours. The write bandwidth usage is sam-
pled in two seconds intervals from every RAID controller
using the manufacturer provided API, and the aggregate
value across all controllers gives the total file system us-
age. From the scheduler’s log, our application (App-1) of
interest was running during the following time periods: 0:25
to 2:42 hrs, 9:30 to 9:56 hrs, 14:53 to 17:31 hrs, and 19:45
to 23:18 hrs. Observing the write bandwidth usage during
the above mentioned time periods we see a clear pattern of
high write bandwidth during the specific applications run-
time. Though there might be other applications which are
running and could be utilizing the file system resource, it is
possible to get an estimate of the I/O usage of the appli-
cation by observing multiple runs of the same application.
The bandwidth metrics captured at the controllers are not
per application, as that would require extensive application
trace information, adding considerable overhead. The ob-
jective is to obtain an estimate of bandwidth utilization of
applications rather than an exact value, and also considering
only those user-applications that are I/O intensive, in our
case a peak write bandwidth of at least 5 GB/s.
Using the time series data, we plot the Cumulative Distri-
bution Function (CDF) of the write bandwidth usage by the
application (App-1), as shown in Figure 3. From the plot we
can infer that for more than 20% of total application run-
time the user is writing at a rate greater than 32 GB/s with
peaks around 42 GB/s. The CDF plot provides us with an
estimate of the user application’s I/O needs. In our study
of a few other applications, a similar pattern was observable
with two other applications, as shown in Table 3. The Ta-
ble summarizes the I/O usage for three applications, with
the peak write bandwidth observed for each application and
what percentage of the total runtime does the application
operate at more than 80% and 50% of the peak write band-
width. The application (App-1) is a short duration routine
that generally runs for a few tens of minutes. However, for
a longer running jobs, say 3 hours or more, it is difficult to
capture such I/O behavior by directly observing the file sys-
tem usage. In general for long running applications, users
do frequent checkpointing, which is one of the most I/O con-
suming task. An auto correlation over the runtime stats of
the application will give us the periodic I/O usage pattern
or the checkpointing pattern of the application. This is ev-
ident in the Figure 2, observing between the time period of
10:00 and 14:30hrs, we see a 10GB/s peak repeating approx-
imately every 40 minutes, and a 20GB/s peak at a different
frequency.
Applications App-1 App-2 App-3
Observed Peak(GB/s) 38 -42 12-15 22-35
Runtime >80% of peak 18-20% 6-5% 4-5%
Runtime >50% of peak 38-42% 12-16% 20-25%
Table 3: Applications I/O Usage
In our analysis of 10 applications we were not able to
determine specific read characteristics of applications. Ob-
serving the read bandwidth of the file system, they generally
tend to be bursty reaching peaks anywhere between 30 GB/s
and 90 GB/s, and then dropping down to 3-4 GB/s. Also,
most of the applications on Jaguar XT5 tend to be com-
frequent reads from the storage system.
3.3 Other Log Messages
Table 1 lists two console log messages, these messages
are indicative of extended application runtime, as a request
was not acknowledged with a response within a time pe-
riod. BEER (Basic End-to-End Reliability) protocol, Cray
XT specific, provides a means for reliable communication
between a pair of NIDs (Network Interface Devices). Inter-
preting the error message in Table 1, the first field is the
timestamp followed by the source node’s ID c21-4c1s4n2.
The message reads cpu 9 of the source has not received a
response from cpu 0 of the compute node 1848, the time-
out period being 240 seconds. These messages are generally
observed in MPI intensive jobs.
The second console log message listed is the file system
timeout message. The first field is the timestamp followed
by the source ID. A request sent by a lustre client (the source
node) to a specific OST (Object Storage Target) via a NID
has timed out after 600 seconds. The NID represented as an
IP address is an InfiniBand router that connects one of the
Jaguar I/O nodes to the InfiniBand network for communi-
cation to the backend storage system. The message implies
that a request sent by the compute node has not been com-
pleted within the maximum timeout period of 600 seconds
and has to resend the request. Lustre timeout messages can
occur for multiple reasons, indicating a fault at the backend
storage servers, or because of network congestion.
Both the BEER and file system timeout messages are in-
dicative of extended runtime of the application. These mes-
sages can be used as indicators for understanding the impact
on an application when running with a mix of other applica-
tions. These errors are usually non-fatal and the application
recovers from these errors. However, an occurrence of these
log events can be indicative of contention of resources within
the compute platforms.
4. IMPACT & EXAMPLES
Quantifying the performance of an application is a diffi-
cult task due to its dependence upon the user’s code and
the input deck, as well as outside influences such as network
and I/O congestion. While identifying hardware failures and
correlating their impact to unexpected performance degra-
dation is reasonably straight-forward, it is quite often the
case that a deeper understanding of the application behav-
ior and system characteristics is required to deduce the rea-
son(s) for poor performance. For example, if an application
is presenting unusually poor performance during its check-
point operations, a quick inspection of the storage system
may present likely answers in the form of a failed RAID
controller or unusually high load on a file system server.
Should no obvious cause be present, one must be able to ex-
ploit knowledge of the application(s) running on the system
at the time of the observed behavior to determine the root
cause.
Our characterization effort provides insight by building
a profile for a user’s application based on system resource
utilization and log events, which can highlight performance
deviation from a ’typical’ run. We present a few preliminary
findings to demonstrate how our approach can be used to
detect performance anomalies, and extended to inform job
scheduling decisions.
4.1 Scheduling Jobs
Characterizing applications by resource utilization pro-
vides us with an understanding of individual application’s
demand. However, it is a bigger challenge to quantify how
such demand translates to conflict of resources when ap-
plications are running on a shared compute platform. To
understand such impacts we present a use case observed on
Jaguar XT5. During one of the sessions, an application (A1)
was running on more than 11k compute nodes and another
I/O intensive application (A2) was running on 2k compute
nodes, among other applications. A1 was heavily using the
interconnect network, which was observed in the netwatch
log and a few deadlock timeouts were also observed. A2 was
not MPI intensive but had a few file system timeout mes-
sages, which were generated by compute nodes running A2.
As explained in the previous section, the file system time-
out messages occur when a node fails to receive a response.
The timeout messages occur when a response has not ar-
rived in 600 seconds, which by itself should have affected
the runtime of A2. File system timeout messages can also
occur if there is a problem with the backend storage system,
however during the specific runtime there was no problems
reported. The nodes generating the file system messages
where complaining on an I/O router that was in the midst of
the nodes allocated to A1. As mentioned earlier, the Jaguar
XT5 has 192 I/O nodes that are spread uniformly across the
3D torus and connect to the backend object storage servers.
For a uniform load distribution, nodes performing I/O will
send/receive requests to the backend storage servers using
all the 192 routers in a round-robin fashion. So A2 nodes will
need to connect to routers via the 3D torus interconnect net-
work. A2, though performing I/O, is affected by A1 because
they are equally dependent on the 3D torus interconnect for
both MPI and I/O communication. The example gives us an
understanding how two applications sharing the same com-
pute platform might affect the performance of each other. It
was difficult to quantify the impact of A1 on A2, and also it
is hard to estimate the impact of A2 on A1, if any existed.
However, such analysis and reasoning should help us make
decisions on how to schedule jobs with little or no impact.
In our data set of four months, one other frequent anomaly
observed was is in the netwatch log. Applications that have a
relatively low rate of packet squashes reported a higher rate
of packet squashes when they were scheduled to run concur-
rently with large jobs that are MPI intensive and heavily
utilize the 3D torus interconnect network. In general the
affected applications are much smaller jobs, running only on
a few hundred or lesser number of nodes, and the allocation
of the nodes is in close proximity to the network intensive
application in the 3D torus. This can be attributed to the
scheduling policy, has smaller jobs are in general given a
lower priority and sometimes get a non-contiguous assign-
ment of nodes on the 3D torus network.
From above examples, we learn that applications are in-
deed affected by other applications that share the same com-
pute platform. The decision on which application to sched-
ule alongside which other applications, is a problem in itself.
4.2 Anomaly Detection
We define anomaly has any deviation from the expected
behavior of an application. Anomalous behavior does not
mean faulty behavior. For example, for the application
(App-1) we have estimated the I/O usage, as demonstrated
sive and when the user job is running we expect to observe
I/O activity in the system. However, when such a behavior
is not observed from the user, it should be classified as an
abnormal behavior. This behavior of the application might
not affect other applications, however the system resources
dedicated towards this application is not being used appro-
priately affecting the system throughput.
In our observation, scientific users have a similar usage
pattern across multiple runs, and any drastic deviation from
previously observed behavior can be classified as an anomaly.
Such anomalous behavior can be the application’s fault or
affected by other applications running on the compute plat-
form. Our approach of characterization can be indicators of
poor or undesirable application performance and can act as
triggers for further investigation.
5. RELATED WORK
Modeling hardware performance and application demands
have been critical towards designing both applications and
next generation systems, and a number of tools have been
proposed to enable such studies. PAPI (Performance API) [11],
provides an interface for accessing hardware performance
counters available across modern microprocessors in real-
time. Monitoring these events helps correlate software per-
formance to processor events. Correlating software to the
underlying architecture provides a capability for hand tun-
ing, compiler optimization, debugging, benchmarking, mon-
itoring and performance modeling of software stacks. The
operating overhead for PAPI is in the range of 2-4%. Cray-
Pat [2], a library for instrumenting and tracing code. The
developers acknowledge that the tool has very large over-
head and should not be used even for moderately large ap-
plications, as the size of the trace files generated are very
large. HPCtoolkit [6] provides an integrated set of tools for
analyzing the performance of program on various system ar-
chitectures. At an overhead of 1-5%, the tools scales for
large-scale systems by statistical sampling of hardware per-
formance counters and timers. As mentioned earlier, these
tools provide fine-grained details at the cost of compute cy-
cles and bandwidth. Our objective has been to gather such
application characteristics using low-overhead system logs
and metrics on a continuous basis.
Conventionally, log messages have been associated with
temporal data mining techniques, where the frequency and
sequence of events are of interest. Recent studies on HPC
logs have focused on machine learning and statistical meth-
ods for analyzing and detecting system failures. In the
machine-learning paradigm described in [12], the log mes-
sage structures are parsed from the source code and a feature
vector is constructed as a sequence of log messages. Using
principal component analysis technique, deviations of the
run-time log from the predefined vectors are identified, and
any variance is defined as an anomaly. SLCT (Simple Logfile
Clustering tool [9]) is a data-clustering paradigm for mining
event patterns in log data. An apriori algorithm, the first
stage is to generate a count of all unique words in the log,
identify log messages containing words above a threshold
value, and then clustering those log lines. This method is
based on the assumption that events of interests occur in
bursts and ignores errors with low frequency.
Nodeinfo [4], an entropy based anomaly detection system,
tags every log message to quantify the importance. Then,
the entropy of every node in the system is quantified based
on the number of occurrences of alerts within a given time
period. It is presumed that all nodes operate similarly, and
the entropy of every node should be the same. Any variance
of entropy would be categorized as an alert/anomaly. Simi-
larly, the models proposed in [5] and [3] group log messages
and use predictive techniques under the assumption that the
logs carry all event information and occur in bursts. Recent
work [8] [7] have suggested using system logs to under-
stand component level interaction in large scale systems and
model the system state leveraging machine learning tech-
niques. First log events or anomalies are correlated to spe-
cific hardware, and then the successive events are mapped
to other components in the system that are affected by the
specific event. This helps understand how system compo-
nents are interdependent and the cascading effect of system
events.
One of the principal assumptions in analyzing systems logs
is that the system supports logging of all events, which may
be impractical for large systems like Jaguar. In petascale
system, logs in general capture only failure events, which
in itself generates a few gigabytes of data per day. Our
approach towards using logs for profiling application charac-
teristics comes with a profound understanding of the system
architecture and applications. This helps correlating events
to errors and understanding the impact of such errors on the
system and application’s performance.
6. CONCLUSION & FUTURE WORK
It has been projected that an exa-scale heterogeneous sys-
tem would have around 15-18k compute nodes, and we can
expect a mix of applications will run concurrently on the sys-
tem, as in Jaguar. Fundamentally, large-scale systems are
shared resource environments and one application will im-
pact the performance of another application running on the
same platform. We will need to understand how a mix of ap-
plications affects the overall application and system through-
put. From our preliminary study, first, we emphasize that
it is user-application characteristics, and not just applica-
tion characteristics, that are required for maximizing system
throughput. Second, we suggest using low-overhead perfor-
mance logs and metrics for continuous monitoring and pro-
filing of application characteristics. As a proof of concept,
we have presented evidence on how such low level metrics
can be used towards profiling application behavior. Also, we
have presented examples where one application affects the
performance of other application. Further extending our
work, we shall profile a larger number of user-applications,
and define methods to capture and profile applications on a
real-time basis.
7. REFERENCES
[1] Totalview.
<http://www.roguewave.com/support/product-
documentation/index.html.aspx>.
[2] Using cray performance analysis tools. Document
S-2474-51, Cray User Documents
<http://docs.cray.com>, 2009.
[3] A. Makanju, A.N. Zincir-Heywood, E. E. Milios.
Clustering event logs using iterative partitioning. In
ACM SIGKDD International conference on
Knowledge discovery and data mining, 2009.
in System Logs. In Proceedings of the International
Conference on Data Mining (ICDM), 2008.
[5] Y. Liang, Y. Zhang, H. Xiong, Hui, R. Sahoo. Failure
Prediction in IBM BlueGene/L Event Logs. In
Proceedings of the International Conference on Data
Mining (ICDM), 2007.
[6] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel,
G. Marini, J. Mellor-Crummey, and N. R. Tallent.
Hpctoolkit: Tools for performance analysis of
optimized parallel programs. Concurrency and
Computation: Practice and Experience, 22(6):685–701,
2010.
[7] J. Oliner, A. Aiken. A query language for
understanding component interactions in production
systems. In Proceedings of the International
Conference on Supercomputing(ICS), 2010.
[8] J. Oliner, A. Aiken. Online detection of
multi-component interactions in production systems.
In Proceedings of the International Conference on
Dependable Systems and Networks(DSN), 2011.
[9] Risto Vaarandi. A data clustering algorithm for
mining patterns from event logs. In IEEE IPOM’03
Proceedings, 2003.
[10] G. M. Shipman, D. A. Dillow, S. Oral, and F. Wang.
The spider center wide file systems; from concept to
reality. In Cray User Group Conference, 2009.
[11] D. Terpstra, H. Jagode, H. You, and J. Dongarra.
Collecting performance data with papi-c. In 3rd
Parallel Tools Workshop, 2009.
[12] W. Xu, L. Huang, A. Fox, D. Patterson, M. Jordan.
Mining Console Logs for Large-Scale System Problem
Detection. In 3rd Workshop on Tackling System
Problems with Machine Learning Techniques(SysML),
2008.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



