Sign up & Download
Sign in

FACT : a Framework for Adaptive Contention-aware Thread Migrations

by Kishore Kumar Pusukuri, David Vengerov, Menlo Park
Learning (2011)

Abstract

Thread scheduling in multi-core systems is a challenging problem because cores on a single chip usually share parts of the memory hierarchy, such as last-level caches, prefetchers and memory controllers, making threads running on different cores interfere with each other while competing for these resources. Data center service providers are interested in compressing the workload onto as few computing units as possible so as to utilize its resources most efficiently and conserve power. However, because memory hierarchy interference between threads is not managed by commercial operating systems, the data center operators still prefer running threads on different chips so as to avoid possible performance degradation due to interference. In this work, we improved the systems throughput by minimizing inter-workload contention for memory hierarchy resources. We achieved this by implementing FACT, a Framework for Adaptive Contention-aware Thread migrations, which measures the relevant performance monitoring events online, learns to predict the effects of interference on performance of workloads, and then makes optimal thread scheduling decisions. We found that when instantiated with a fuzzy rule-based (FRB) predictive model, FACT achieves on average a 74% prediction accuracy on the new data. In experiments conducted on a quad-core machine running OpenSolarisTM, SPECcpu2006 workloads under FACT-FRB ran up to 11.6% faster than under the default OpenSolaris scheduler. FACT-FRB was also able to find the best combination of workloads more consistently than the state-of-the-art algorithms that aim to minimize contention for memory resources on each chip. Unlike these algorithms that based on fixed heuristics, FACT can be easily adapted to consider other performance factors so as to accommodate changes in architectural features and performance bottlenecks in future systems.

Cite this document (BETA)

Page 1
hidden

FACT : a Framework for Adaptive Contention-aware Thread Migrations

FACT: a Framework for Adaptive Contention-aware
Thread Migrations
Kishore Kumar Pusukuri
University of California,
Riverside, USA.
kishore@cs.ucr.edu
David Vengerov
Oracle Corporation
Menlo Park, USA.
david.vengerov@oracle.com
Alexandra Fedorova
Simon Fraser University
Vancouver, Canada.
fedorova@cs.sfu.ca
Vana Kalogeraki

Athens University of
Economics and Business
Greece.
vana@aueb.gr
ABSTRACT
Thread scheduling in multi-core systems is a challenging problem
because cores on a single chip usually share parts of the memory
hierarchy, such as last-level caches, prefetchers and memory con-
trollers, making threads running on different cores interfere with
each other while competing for these resources. Data center service
providers are interested in compressing the workload onto as few
computing units as possible so as to utilize its resources most effi-
ciently and conserve power. However, because memory hierarchy
interference between threads is not managed by commercial operat-
ing systems, the data center operators still prefer running threads on
different chips so as to avoid possible performance degradation due
to interference.
In this work, we improved the system’s throughput by minimiz-
ing inter-workload contention for memory hierarchy resources. We
achieved this by implementing FACT, a Framework for Adaptive
Contention-aware Thread migrations, which measures the relevant
performance monitoring events online, learns to predict the effects
of interference on performance of workloads, and then makes opti-
mal thread scheduling decisions. We found that when instantiated
with a fuzzy rule-based (FRB) predictive model, FACT achieves on
average a 74% prediction accuracy on the new data. In experiments
conducted on a quad-core machine running OpenSolarisTM , SPEC-
cpu2006 workloads under FACT-FRB ran up to 11.6% faster than
under the default OpenSolaris scheduler. FACT-FRB was also able
to find the best combination of workloads more consistently than
the state-of-the-art algorithms that aim to minimize contention for
memory resources on each chip. Unlike these algorithms that based
on fixed heuristics, FACT can be easily adapted to consider other
performance factors so as to accommodate changes in architectural
features and performance bottlenecks in future systems.
The research of this author has been supported by a gift from SUN,
the European Union through the Marie-Curie RTD (IRG-231038)
project and by AUEB through a PEVE project.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CF’11, May 3–5, 2011, Ischia, Italy.
Copyright 2011 ACM 978-1-4503-0698-0/11/05 ...$10.00.
Categories and Subject Descriptors
D.4.1 [Process Management]: Scheduling
General Terms
Performance, Measurement, Algorithms
Keywords
Multicore, Scheduling, Operating Systems, Supervised Learning
1 Introduction
In recent years, there has been an increase of computational power
demand to satisfy modern user needs and solve complex scientific
problems such as genome analysis, molecular analysis, weather
prediction, video encoding, etc. In the quest for huge computational
power, hardware engineers are building multi-core systems because
multi-core processors offer significantly greater parallelism, and
performance relative to uniprocessor systems. However, multi-core
architectures bring new challenges to the system software (com-
pilers, OS). For example, applications running on different cores
may require efficient inter-process communication mechanisms, a
shared-memory data infrastructure, and synchronization primitives
to protect shared resources. Efficient code migration is also a chal-
lenging problem in such systems[7].
Modern operating systems (OS) do not effectively make use of
the fact that cores often share resources within the memory hierar-
chy such as cache, prefetcher, memory bus, memory controller, etc.
Therefore, the OS may fail to fully exploit the capabilities of the
system. The operating systems must therefore evolve in order to
fully support the new multi-core systems with appropriate process
scheduling and memory management techniques. There are two
important challenges [3] that need to be addressed: 1) the OS needs
to understand how different workloads utilize resources within the
memory hierarchy and the effect of resource sharing on performance
of all workloads; 2) the OS needs an efficient way to migrate work-
loads between the cores so that workloads don’t hurt each other’s
performance when sharing such resources.
As a step toward addressing these challenges, we developed
a Framework for Adaptive Contention-aware Thread migrations
(FACT), which is based on machine learning (more specifically,
supervised learning) techniques. FACT trains a statistical model
to predict contention for memory resources between threads based
on performance monitoring events recorded by the OS (see [28,
2] for a list of possible events). The trained model is then used
to dynamically schedule together threads that interfere with each
Page 2
hidden
other’s performance as little as possible. This paper explains the
development and evaluation of FACT.
Instruction per Cycle (IPC) is a standard way of measuring per-
formance of a workload. The following is the basic idea behind
FACT: it considers some number of possible thread migrations, and
for each migration it predicts the change in the IPC of all threads
affected by the migration. Predictions are made using a performance
model that is learned online based on hardware performance counter
measurements taken in real-time. If some migrations are predicted
to increase the average IPC of all affected threads, then the migra-
tion resulting in the largest IPC increase is performed; otherwise, no
migration takes place at this cycle and the next cycle of observing
the relevant performance monitoring events begins.
We implemented FACT on a quad-core Intel x86 based server ma-
chine running OpenSolaris (2009.06). OpenSolaris libcpc (3LIB) [29]
and libpctx (3LIB) [30] interfaces were used to read and program
performance monitoring counters. We developed a variety of statisti-
cal models for predicting the IPC of co-scheduled workloads: linear
regression models(LR)) [33], fuzzy rule-based models (FRB) [21],
decision tree (Rpart) models [40], and K-nearest neighbor models
(k-nn) [27]. The FRB model had the highest prediction accuracy of
74% in our experiments and was chosen as the preferred statistical
model to be used inside the FACT framework. Our experimental
results demonstrate that the FACT-FRB combination resulted in
a speedup of up to 11.6% on the considered SPECcpu2006 [38]
benchmarks.
When comparing FACT to another state-of-the-art contention
management scheduling algorithm [4], we found that FACT out-
performs it by finding the best combination of workloads more
consistently. Moreover, that algorithm is based on a fixed heuris-
tic that works well only when the whole complexity of contention
for shared resources can be captured with a single value, while
FACT can dynamically learn relationships between any number of
performance-affecting factors on any target architecture. Therefore,
we believe FACT will be able to better evolve despite changes in
processor architecture and in resulting performance bottlenecks.
Moreover the overhead of FACT is negligible.
This paper focuses on single-threaded workloads, and therefore
the term workload will always mean a single-threaded workload
unless explicitly stated otherwise.
The following are the main contributions of this work:
 Identified the best predictors (based on the cache usage data)
for predicting the IPC of co-scheduled workloads by fully
exploiting performance monitoring events
 Developed several statistical models that use performance
monitoring events as inputs and identified the best model for
predicting the IPC
 Developed adaptive migration techniques to reduce total running-
times of workloads
 Showed the usage of supervised learning techniques to capture
the complexity of modern multi-core systems.
 Provided the possibility of extending FACT for predicting any
thread resource usage characteristics (performance, power
etc.) by adding the relevant performance monitoring events
(such as those reflecting the usage of CPU, Main Memory,
Disk, IO, etc.)
This remainder of this paper is organized as follows: Section 2
gives the motivation and Section 3 gives a high-level description
of the FACT framework. The process of developing a statistical
model for predicting IPC of co-scheduled workloads is described in
Section 4. Section 5 describes the implementation of FACT using
the OpenSolaris libraries libpctx and libcpc. Section 6 presents
evaluation results for the FACT framework and its comparison with
a state-of-the-art thread allocation algorithm. Finally, the related
work is described in Section 7 and conclusions and the future work
are discussed in Section 8.
2 Motivation and Background
Multi-core processors allow running multiple threads simultane-
ously, but contention for caches and other shared resources, such
as memory controllers and prefetchers, becomes an obstacle in
achieving a proportional improvement in the system’s throughput.
Previous work observed that on modern Intel and AMD systems
the degree of contention for shared resources can be explained by the
relative memory-intensity of threads that share resources [4]. In that
work, threads were classified as memory-intensive or CPU-intensive.
Memory-intensity was approximated by a thread’s relative misses
per instruction (MPI): memory-intensive threads have a higher MPI
than the CPU-intensive threads. That work found that in order to
significantly reduce contention for shared resources the scheduler
must run memory-intensive threads (those with high MPI) on sepa-
rate core groups1, thus preventing them from competing for shared
resources.
Even though the MPI heuristic, used to determine memory-intensity,
does not directly capture how much a thread would suffer from cache
contention alone (for instance, a streaming application would have
a high MPI but no cache reuse and so it will not suffer from cache
interference), it still turned out to be a rather good heuristic for pre-
dicting overall memory hierarchy resource contention, because on
modern systems the contention is dominated by memory controller
or memory bus contention, and MPI is a good approximation of how
intensely an application uses these resources [4].
As will be shown in Section 6, the degree of thread interference
cannot be predicted accurately by using just MPI alone, and our
proposed FACT framework addresses the shortcomings of single-
heuristic algorithms that use only MPI by considering other factors
as well. However, for the sake of providing a sufficient background
on the state-of-the-art, we first show why simple MPI-based al-
gorithms do significantly better than naïve schedulers that ignore
contention for memory hierarchy resources.
Core 0 Core 2 Core 1 Core 3
L2 L2
T1 T3 T4T2
Figure 1: A typical quad-core system.
Consider the following experiment. Figure 1 shows a typical
quad-core system, core-0 and core-2 are sharing one last-level cache
and other associated memory resources (a prefetcher and a memory
controller on our experimental system), and core-1 and core-3 are
sharing another last-level cache and the associated set of memory re-
sources. Let’s assume that threads T1 and T3 are memory-intensive
and threads T2 and T4 are CPU-intensive. We will next show that
keeping the two memory-intensive threads (T1 and T3) on the same
core group results in a performance degradation for both T1 and T3,
and as a result degrades the overall system throughput.
1A core group is a group of cores sharing various resources, such
as caches, prefetchers, memory controllers, system request queue
controllers, etc.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

4 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
50% Student (Master)
 
25% Post Doc
 
25% Professor
by Country
 
25% South Korea
 
25% China
 
25% Japan