Cache Organization and Memory Management of the Intel Nehalem Computer Architecture
Available from rolfed.com
Page 1
Cache Organization and Memory Management of the Intel Nehalem Computer Architecture
Cache Organization and Memory Management
of the Intel Nehalem Computer Architecture
Trent Rolf
University of Utah Computer Engineering
CS 6810 Final Project
December 2009
Abstract Intel is now shipping microprocessors using their
new architecture codenamed Nehalem as a successor to the
Core architecture. This design uses multiple cores like its prede-
cessor, but claims to improve the utilization and communication
between the individual cores. This is primarily accomplished
through better memory management and cache organization.
Some benchmarking and research has been performed on the
Nehalem architecture to analyze the cache and memory improve-
ments. In this paper I take a closer look at these studies to
determine if the performance gains are signi cant.
I. INTRODUCTION
The predecessor to Nehalem, Intel’s Core architecture, made
use of multiple cores on a single die to improve performance
over traditional single-core architectures. But as more cores
and processors were added to a high-performance system,
some serious weaknesses and bandwidth bottlenecks began to
appear.
After the initial generation of dual-core Core processors,
Intel began a Core 2 series processor which was not much
more than using two or more pairs of dual-core dies. The cores
communicated via system memory which caused large delays
due to limited bandwidth on the processor bus [5]. Adding
more cores increased the burden on the processor and memory
buses, which diminished the performance gains that could be
possible with more cores.
The new Nehalem architecture sought to improve core-to-
core communication by establishing a point-to-point topology
in which microprocessor cores can communicate directly with
one another and have more direct access to system memory.
II. OVERVIEW OF NEHALEM
A. Architectural Approach
The approach to the Nehalem architecture is more modular
than the Core architecture which makes it much more exible
and customizable to the application. The architecture really
only consists of a few basic building blocks. The main blocks
are a microprocessor core (with its own L2 cache), a shared
L3 cache, a Quick Path Interconnect (QPI) bus controller, an
integrated memory controller (IMC), and graphics core.
With this exible architecture, the blocks can be con gured
to meet what the market demands. For example, the Bloom-
eld model, which is intended for a performance desktop ap-
plication, has four cores, an L3 cache, one memory controller,
and one QPI bus controller. Server microprocessors like the
Fig. 1. Eight-core Nehalem Processor [1]
Beckton model can have eight cores, and four QPI bus con-
trollers [5]. The architecture allows the cores to communicate
very effectively in either case. The speci cs of the memory
organization are described in detail later.
Figure 1 is an example of an eight-core Nehalem processor
with two QPI bus controllers. This is the con guration of the
processor used in [1].
B. Branch Prediction
Another signi cant improvement in the Nehalem microar-
chitecture involves branch prediction. For the Core architec-
ture, Intel designed what they call a Loop Stream Detector,
which detects loops in code execution and saves the instruc-
tions in a special buffer so they do not need to be contin-
ually fetched from cache. This increased branch prediction
success for loops in the code and improved performance. Intel
engineers took the concept even further with the Nehalem
architecture by placing the Loop Stream Detector after the
decode stage eliminating the instruction decode from a loop
iteration and saving CPU cycles.
C. Out-of-order Execution
Out-of-order execution also greatly increases the perfor-
mance of the Nehalem architecture. This feature allows the
processor to ll pipeline stalls with useful instructions s o
the pipeline ef ciency is maximized. Out-of-order executi on
was present in the Core architecture, but in the Nehalem
of the Intel Nehalem Computer Architecture
Trent Rolf
University of Utah Computer Engineering
CS 6810 Final Project
December 2009
Abstract Intel is now shipping microprocessors using their
new architecture codenamed Nehalem as a successor to the
Core architecture. This design uses multiple cores like its prede-
cessor, but claims to improve the utilization and communication
between the individual cores. This is primarily accomplished
through better memory management and cache organization.
Some benchmarking and research has been performed on the
Nehalem architecture to analyze the cache and memory improve-
ments. In this paper I take a closer look at these studies to
determine if the performance gains are signi cant.
I. INTRODUCTION
The predecessor to Nehalem, Intel’s Core architecture, made
use of multiple cores on a single die to improve performance
over traditional single-core architectures. But as more cores
and processors were added to a high-performance system,
some serious weaknesses and bandwidth bottlenecks began to
appear.
After the initial generation of dual-core Core processors,
Intel began a Core 2 series processor which was not much
more than using two or more pairs of dual-core dies. The cores
communicated via system memory which caused large delays
due to limited bandwidth on the processor bus [5]. Adding
more cores increased the burden on the processor and memory
buses, which diminished the performance gains that could be
possible with more cores.
The new Nehalem architecture sought to improve core-to-
core communication by establishing a point-to-point topology
in which microprocessor cores can communicate directly with
one another and have more direct access to system memory.
II. OVERVIEW OF NEHALEM
A. Architectural Approach
The approach to the Nehalem architecture is more modular
than the Core architecture which makes it much more exible
and customizable to the application. The architecture really
only consists of a few basic building blocks. The main blocks
are a microprocessor core (with its own L2 cache), a shared
L3 cache, a Quick Path Interconnect (QPI) bus controller, an
integrated memory controller (IMC), and graphics core.
With this exible architecture, the blocks can be con gured
to meet what the market demands. For example, the Bloom-
eld model, which is intended for a performance desktop ap-
plication, has four cores, an L3 cache, one memory controller,
and one QPI bus controller. Server microprocessors like the
Fig. 1. Eight-core Nehalem Processor [1]
Beckton model can have eight cores, and four QPI bus con-
trollers [5]. The architecture allows the cores to communicate
very effectively in either case. The speci cs of the memory
organization are described in detail later.
Figure 1 is an example of an eight-core Nehalem processor
with two QPI bus controllers. This is the con guration of the
processor used in [1].
B. Branch Prediction
Another signi cant improvement in the Nehalem microar-
chitecture involves branch prediction. For the Core architec-
ture, Intel designed what they call a Loop Stream Detector,
which detects loops in code execution and saves the instruc-
tions in a special buffer so they do not need to be contin-
ually fetched from cache. This increased branch prediction
success for loops in the code and improved performance. Intel
engineers took the concept even further with the Nehalem
architecture by placing the Loop Stream Detector after the
decode stage eliminating the instruction decode from a loop
iteration and saving CPU cycles.
C. Out-of-order Execution
Out-of-order execution also greatly increases the perfor-
mance of the Nehalem architecture. This feature allows the
processor to ll pipeline stalls with useful instructions s o
the pipeline ef ciency is maximized. Out-of-order executi on
was present in the Core architecture, but in the Nehalem
Page 2
architecture the reorder buffer has been greatly increased to
allow more instructions to be ready for immediate execution.
D. Instruction Set
Intel also added seven new instructions to the instruction set.
These are single-instruction, multiple-data (SIMD) instructions
that take advantage of data-level parallelism for today’s data-
intensive applications (like multimedia). Intel refers to the new
instructions as Applications Targeted Accelerators (ATA) due
to their specialized nature. For example, a few instructions
are used explicitly for ef cient text processing such as XML
parsing. Another instruction is used just for calculating check-
sums.
E. Power Management
For past architectures Intel has used a single power man-
agement circuit to adjust voltage and clock frequencies even
on a die with multiple cores. With many cores, this strategy
becomes wasteful because the load across cores is rarely uni-
form. Looking forward to a more scalable power management
strategy, Intel engineers decided to put yet another processing
unit on the die called the Power Control Unit (PCU).
The PCU rmware is much more exible and capable
than the dedicated hardware circuit on previous architectures.
Figure 2 shows how the PSU interacts with the cores. It uses
sensors to read temperature, voltage, and current across all
cores in the system and adjusts the clock frequency and supply
voltage accordingly. This enables the cores to get exactly what
they need, including putting a core to sleep if it is not being
used at all.
While these and other features contribute to the performance
and ef ciency of a Nehalem processor, the remainder of
this paper will focus on the cache organization, memory
architecture, and communication between cores.
III. CACHE AND MEMORY SPECIFICS
A. Transition Lookaside Buffer
The transition lookaside buffer (TLB) plays a critical role
in the cache performance. It is a high-speed buffer that maps
virtual addresses to physical addresses in the cache or memory.
When a page of memory is mapped in the TLB, it is accessed
quickly in the cache. When the TLB is too small, misses occur
more frequently. The TLB in the Nehalem architecture is much
larger than previous architectures which allows for many more
memory page references to remain in the TLB.
In addition, Intel made the TLB dual-level by adding an
L2 TLB. The second-level TLB is larger than the rst level
and can store up to 512 entries [5]. The gains from the TLB
changes are signi cant, but the most dramatic improvements
come from the changes to the overall cache-memory layout.
B. Cache and Cache Coherency
In the Core architecture, each pair of cores shared an L2
cache. This allowed the two cores to communicate ef ciently
with each other, but as more cores were added it proved
dif cult to implement ef cient communication with more pai rs
Fig. 2. Power Control Unit (PSU) in a Multi-core Nehalem Architecture [5]
of cores. For the Nehalem architecture each core has its own
L2 cache of 256KB. Although this is smaller than the L2 cache
of the Core architecture, it is lower latency allowing for faster
L2 cache performance.
Nehalem does still have shared cache, though, implemented
as L3 cache. This cache is shared among all cores and is rela-
tively large. For example, a quad-core Nehalem processor will
have an 8MB L3 cache. This cache is inclusive, meaning that
it duplicates all data stored in each indivitual L1 and L2 cache.
This duplication greatly adds to the inter-core communication
ef ciency because any given core does not have to locate data
in another processor’s cache. If the requested data is not found
in any level of the core’s cache, it knows the data is also not
present in any other core’s cache.
To insure coherency across all caches, the L3 cache has
additional ags that keep track of which core the data came
from. If the data is modi ed in L3 cache, then the L3 cache
knows if the data came from a different core than last time,
and that the data in the rst core needs its L1/L2 values
updated with the new data. This greatly reduces the amount
of traditional snooping coherency traf c between cores.
This new cache organization is known as the MESIF (Mod-
i ed, Exclusive, Shared, Invalid, Forward) protocol, whic h is
a modi cation of the popular MESI protocol. Each cache line
is in one of the ve states:
• Modi ed - The cache line is only present in the current
cache and does not match main memory (dirty). This line
must be written back to main memory before any other
reads of that address take place.
allow more instructions to be ready for immediate execution.
D. Instruction Set
Intel also added seven new instructions to the instruction set.
These are single-instruction, multiple-data (SIMD) instructions
that take advantage of data-level parallelism for today’s data-
intensive applications (like multimedia). Intel refers to the new
instructions as Applications Targeted Accelerators (ATA) due
to their specialized nature. For example, a few instructions
are used explicitly for ef cient text processing such as XML
parsing. Another instruction is used just for calculating check-
sums.
E. Power Management
For past architectures Intel has used a single power man-
agement circuit to adjust voltage and clock frequencies even
on a die with multiple cores. With many cores, this strategy
becomes wasteful because the load across cores is rarely uni-
form. Looking forward to a more scalable power management
strategy, Intel engineers decided to put yet another processing
unit on the die called the Power Control Unit (PCU).
The PCU rmware is much more exible and capable
than the dedicated hardware circuit on previous architectures.
Figure 2 shows how the PSU interacts with the cores. It uses
sensors to read temperature, voltage, and current across all
cores in the system and adjusts the clock frequency and supply
voltage accordingly. This enables the cores to get exactly what
they need, including putting a core to sleep if it is not being
used at all.
While these and other features contribute to the performance
and ef ciency of a Nehalem processor, the remainder of
this paper will focus on the cache organization, memory
architecture, and communication between cores.
III. CACHE AND MEMORY SPECIFICS
A. Transition Lookaside Buffer
The transition lookaside buffer (TLB) plays a critical role
in the cache performance. It is a high-speed buffer that maps
virtual addresses to physical addresses in the cache or memory.
When a page of memory is mapped in the TLB, it is accessed
quickly in the cache. When the TLB is too small, misses occur
more frequently. The TLB in the Nehalem architecture is much
larger than previous architectures which allows for many more
memory page references to remain in the TLB.
In addition, Intel made the TLB dual-level by adding an
L2 TLB. The second-level TLB is larger than the rst level
and can store up to 512 entries [5]. The gains from the TLB
changes are signi cant, but the most dramatic improvements
come from the changes to the overall cache-memory layout.
B. Cache and Cache Coherency
In the Core architecture, each pair of cores shared an L2
cache. This allowed the two cores to communicate ef ciently
with each other, but as more cores were added it proved
dif cult to implement ef cient communication with more pai rs
Fig. 2. Power Control Unit (PSU) in a Multi-core Nehalem Architecture [5]
of cores. For the Nehalem architecture each core has its own
L2 cache of 256KB. Although this is smaller than the L2 cache
of the Core architecture, it is lower latency allowing for faster
L2 cache performance.
Nehalem does still have shared cache, though, implemented
as L3 cache. This cache is shared among all cores and is rela-
tively large. For example, a quad-core Nehalem processor will
have an 8MB L3 cache. This cache is inclusive, meaning that
it duplicates all data stored in each indivitual L1 and L2 cache.
This duplication greatly adds to the inter-core communication
ef ciency because any given core does not have to locate data
in another processor’s cache. If the requested data is not found
in any level of the core’s cache, it knows the data is also not
present in any other core’s cache.
To insure coherency across all caches, the L3 cache has
additional ags that keep track of which core the data came
from. If the data is modi ed in L3 cache, then the L3 cache
knows if the data came from a different core than last time,
and that the data in the rst core needs its L1/L2 values
updated with the new data. This greatly reduces the amount
of traditional snooping coherency traf c between cores.
This new cache organization is known as the MESIF (Mod-
i ed, Exclusive, Shared, Invalid, Forward) protocol, whic h is
a modi cation of the popular MESI protocol. Each cache line
is in one of the ve states:
• Modi ed - The cache line is only present in the current
cache and does not match main memory (dirty). This line
must be written back to main memory before any other
reads of that address take place.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
13 Readers on Mendeley
by Discipline
by Academic Status
46% Ph.D. Student
23% Student (Bachelor)
15% Student (Master)
by Country
31% United States
15% Belgium
15% South Korea


