ULCC : A User-Level Facility for Optimizing Shared Cache Performance on Multicores
- ISBN: 9781450301190
- DOI: 10.1145/1941553.1941568
Abstract
Scientific applications face serious performance challenges on multicore processors, one of which is caused by access contention in last level shared caches from multiple running threads. The contention increases the number of long latency memory accesses, and consequently increases application execution times. Optimizing shared cache performance is critical to reduce significantly execution times of multi-threaded programs on multicores. However, there are two unique problems to be solved before implementing cache optimization techniques on multicores at the user level. First, available cache space for each running thread in a last level cache is difficult to predict due to access contention in the shared space, which makes cache conscious algorithms for single cores ineffective on multicores. Second, at the user level, programmers are not able to allocate cache space at will to running threads in the shared cache, thus data sets with strong locality may not be allocated with sufficient cache space, and cache pollution can easily happen. To address these two critical issues, we have designed ULCC (User Level Cache Control), a software runtime library that enables programmers to explicitly manage and optimize last level cache usage by allocating proper cache space for different data sets of different threads. We have implemented ULCC at the user level based on a page-coloring technique for last level cache usage management. By means of multiple case studies on an Intel multicore processor, we show that with ULCC, scientific applications can achieve significant performance improvements by fully exploiting the benefit of cache optimization algorithms and by partitioning the cache space accordingly to protect frequently reused data sets and to avoid cache pollution. Our experiments with various applications show that ULCC can significantly improve application performance by nearly 40%.
Author-supplied keywords
ULCC : A User-Level Facility for Optimizing Shared Cache Performance on Multicores
Shared Cache Performance on Multicores
Xiaoning Ding
The Ohio State University
dingxn@cse.ohio-state.edu
Kaibo Wang
The Ohio State University
wangka@cse.ohio-state.edu
Xiaodong Zhang
The Ohio State University
zhang@cse.ohio-state.edu
Abstract
Scientific applications face serious performance challenges on mul-
ticore processors, one of which is caused by access contention in
last level shared caches from multiple running threads. The con-
tention increases the number of long latency memory accesses,
and consequently increases application execution times. Optimiz-
ing shared cache performance is critical to significantly reduce ex-
ecution times of multi-threaded programs on multicores. However,
there are two unique problems to be solved before implementing
cache optimization techniques on multicores at the user level. First,
available cache space for each running thread in a last level cache
is difficult to predict due to access contention in the shared space,
which makes cache conscious algorithms for single cores ineffec-
tive on multicores. Second, at the user level, programmers are not
able to allocate cache space at will to running threads in the shared
cache, thus data sets with strong locality may not be allocated with
sufficient cache space, and cache pollution can easily happen.
To address these two critical issues, we have designed ULCC
(User Level Cache Control), a software runtime library that en-
ables programmers to explicitly manage and optimize last level
cache usage by allocating proper cache space for different data
sets of different threads. We have implemented ULCC at the user
level based on a page-coloring technique for last level cache usage
management. By means of multiple case studies on an Intel mul-
ticore processor, we show that with ULCC, scientific applications
can achieve significant performance improvements by fully exploit-
ing the benefit of cache optimization algorithms and by partition-
ing the cache space accordingly to protect frequently reused data
sets and to avoid cache pollution. Our experiments with various ap-
plications show that ULCC can significantly improve application
performance by nearly 40%.
Categories and Subject Descriptors D.1.3 [Programming Tech-
niques]: Concurrent Programming—Parallel programming
General Terms Algorithms, Design, Performance
Keywords Multicore, Cache, Scientific Computing
1. Introduction
Multicore processors have been widely used in all kinds of comput-
ing platforms from laptops to large supercomputers. According to
Currently working at Intel Labs Pittsburgh.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
PPoPP’11, February 12–16, 2011, San Antonio, Texas, USA.
Copyright c
2011 ACM 978-1-4503-0119-0/11/02. . . $10.00
the recently announced Top 500 supercomputer list, by June 2010,
85% of these supercomputers have been equipped with quad-core
processors and 5% use processors with six or more cores [27].
However, application programming is facing a new performance
challenge on multicore processors caused by the bottleneck of the
memory system (e.g. [19, 33]). On a multicore processor, the last
level cache space and the bandwidth to access memory are usually
shared and contended among multiple computing cores. The con-
tention for the shared cache increases the amount of accesses to
off-chip main memory, and the contention for memory bandwidth
increases the queuing delay of memory accesses. The accumulated
contention in the cache and the memory bus significantly delays the
execution time due to inefficient management of the shared cache
at runtime. Optimizing the performance of shared last level caches
can reduce both slow memory accesses and memory bandwidth de-
mand, which has become a critical technique to address the perfor-
mance issue in multicore processors.
Cache optimization at the user level has been one of the most
effective methods to improve execution performance for scien-
tific applications on the platforms with single-core processors. A
large number of research projects have been carried out to restruc-
ture algorithms and programming with cache optimization (e.g.
[5, 8, 12, 20, 28–31]). However, cache optimization in multicore
processors faces two new challenges due to architectural changes
in the memory hierarchy.
One important factor for cache conscious programming is the
available cache size for given data sets in order to fully utilize the
cache with minimum misses [33]. However, on multicores, due to
access dynamics to shared caches, the available cache space size for
each thread can be hardly predicted, particularly when an applica-
tion is programmed in a MPMD model (multiple programs co-run
on the same set of processors). Let’s take a blocking algorithm for
linear systems as an example. In such an algorithm, the block size is
an important factor affecting performance, and an unsuitable block
size causes extra cache misses, leading to poor execution perfor-
mance. An optimal block size is usually a function of the available
cache space size for blocking. On a single-core processor the last
level cache is not shared, and the available space for blocking is
determined by the cache size. However, on a multicore platform,
a last level cache is shared among multiple threads co-running on
multiple cores. How much cache space a thread can occupy is de-
termined by dynamic access patterns of the thread and other threads
sharing the cache with it. Thus, it is difficult for a programmer to
determine the available cache space for each thread to make effec-
tive blocking actions. In practice, a sub-optimal block size may be
selected, causing mediocre or even poor performance.
Another major source of poor performance in a multicore pro-
cessor is last level cache pollution, which is a more serious problem
than that on a single-core processor because multiple tasks are af-
fected. Cache pollution is incurred when a thread accesses a sizable
out reuses), and consequently replaces data sets with strong locality
(i.e. data with frequent reuses in the cache). For threads running on
single-core processors, because of the private cache architecture, a
thread can only pollute the cache on one core (processor) and can-
not affect threads running on other cores (processors). However, on
a multicore processor, because of the shared cache architecture, a
thread can pollute the whole shared cache and affect all the other
threads sharing the cache. Furthermore, on a multicore processor,
multiple threads may access weak locality data sets simultaneously
and evict strong locality data sets very quickly.
Due to the above serious concerns, it is highly desirable for pro-
grammers to distinguish weak locality data sets from strong locality
data sets and explicitly specify different space allocation priorities
to them in shared caches on multicore processors. In other words,
a strong locality data set should be protected by allocating it with
sufficient space, while a weak locality data set must be carefully
watched by giving it limited space. Unfortunately, programmers
lack necessary system support to make effective allocation actions
even though the programmers are very knowledgeable about the
locality strength of each data set.
To address these two critical issues, we present ULCC (User
Level Cache Control), a software runtime library that enables pro-
grammers to explicitly manage space sharing and contention in last
level caches by making cache allocation decisions based on data lo-
cality strengths. With the functions provided by ULCC, program-
mers can hand-tune their programs to optimize the performance
of last level caches on multicores. Unlike database applications or
server applications, whose access patterns are dynamic and some-
times determined by the distribution of their data or requests, most
scientific applications have regular and consistent access patterns.
Thus, scientific application programmers can determine the sizes
and locality strengths of data sets based on their algorithms in the
programming stage. With the locality information and our effective
support from ULCC, programmers can make effective decisions
and enforce a necessary cache space allocation for their programs
that can facilitate cache optimization in the programming stage and
ensure that strong locality data stay in the cache during executions.
We make three major contributions in this paper. First, we have
carefully designed ULCC as a runtime library to enable user level
cache controls for application programming. We provide a set of
functions in ULCC to support different programming models, such
as MPI, OpenMP, and pthread, to tightly couple our ULCC im-
plementation with commonly used programming interfaces. With
these functions, ULCC allows programmers to manage cache space
allocation flexibly and effectively while it hiding most complex-
ity of the cache structure on multicore architecture and ULCC im-
plementation details. Thus, programmers can focus on analyzing
their algorithms and planning optimal cache space allocation; while
ULCC can focus on helping users making full utilization of cache
space with least overhead. Second, we have implemented a pro-
totype of ULCC at the user level based on operating system sup-
port. Though ULCC relies on the page coloring technique [15], it
does not require OS kernel modifications. This makes ULCC highly
portable. Finally, we have tested ULCC with extensive experiments
as to its effectiveness in improving the performance of scientific
programs. We have also evaluated the overhead of ULCC. Our ex-
periments show that ULCC can effectively and significantly im-
prove execution performance with negligible overhead.
The remainder of the paper is organized as follows. In Section 2,
we introduce the motivation and the background information of
ULCC. Then in Section 3 we present the overall structure of ULCC,
the design of its key components, and the implementation based
on the Linux system. We present our experience with ULCC in
Section 4. Finally, we discuss related works in Section 5, and
conclude the paper in Section 6.
2. Motivation and Background
In this section, with a motivating example, we illustrate the chal-
lenges a programmer may encounter in cache optimization. We
show how ULCC works to help the programmer address the chal-
lenges, and explain the underlying techniques ULCC relies on to
achieve this goal.
… … …
Thread
0
Thread
1
Thread
2
…
Thread
3
Bucke
ting Sorting
blocks
in a bu
cket w
/ merg
e sort
Multi-
way m
erge
Figure 1. An application sorting a large array with multiple
threads
The example program sorts elements in a large array with four
threads in parallel on a quad-core processor. The program first
rearranges the elements into a number of buckets according to
the values of their keys in a way similar to bucket sort. Then it
sorts the elements in every bucket with merge sort. To improve
cache efficiency, optimizations including blocking and multi-way
merging are applied as suggested in [13]. For each bucket, a thread
first partitions the elements into blocks. Then it sorts the blocks
one by one with merge sort. When a thread sorts a block, it uses a
sorting buffer to store the intermediate results of merge sort. The
buffer has the same size as the block size, and is reused for sorting
different blocks. The block size should be adjusted to guarantee
that sorting each block does not incur extra memory accesses after
the block has been loaded into the last level cache. After all the
blocks have been sorted, the thread merges the sorted blocks in
one pass with a multi-way merging by constructing a full binary
tree structure. Therefore, after the buckets are ready, each thread
repeatedly selects an unsorted bucket, sorts the blocks in it, and
merges the sorted blocks, as illustrated in Figure 1.
When the program runs on a quad-core X5355 processor in
which there are two pairs of cores and cores in each pair share an
L2 cache, the interference caused by cache contention and cache
pollution can significantly slowdown its execution. Cache pollution
happens when one thread is sorting blocks and the other thread
sharing the same L2 cache with it is merging sorted blocks. Most of
the data accessed by merging, including the sorted blocks and the
buffer saving the final results, will not be reused. Accessing them
means loading them into the L2 cache and evicting the to-be-reused
data, e.g. the block being sorted and the sorting buffer.
Cache space contention happens when both threads sharing the
same L2 cache work on sorting blocks. For the threads sharing the
same L2 cache, if the aggregated size of their sorting buffers and
the blocks they are sorting exceeds the L2 cache size, severe cache
contention will occur. To quickly sort the elements in each block,
the total size of the blocks being sorted and sorting buffers should
fit into the last level cache. However, cache contention still happens
when a thread finishes sorting a block and starts to work on another
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



