Early experience with the Barrelfish OS and the Single-Chip Cloud Computer
Available from www.inf.ethz.ch
Page 1
Early experience with the Barrelfish OS and the Single-Chip Cloud Computer
Early experience with the Barrelfish OS
and the Single-Chip Cloud Computer
Simon Peter, Adrian Schüpbach, Dominik Menzi and Timothy Roscoe
Systems Group, Department of Computer Science, ETH Zurich
Abstract—Traditional OS architectures based on a single,
shared-memory kernel face significant challenges from hardware
trends, in particular the increasing cost of system-wide cache-
coherence as core counts increase, and the emergence of hetero-
geneous architectures – both on a single die, and also between
CPUs, co-processors like GPUs, and programmable peripherals
within a platform.
The multikernel is an alternative OS model that employs
message passing instead of data sharing and enables architecture-
agnostic inter-core communication, including across non-coherent
shared memory and PCIe peripheral buses. This allows a single
OS instance to manage the complete collection of heterogeneous,
non-cache-coherent processors as a single, unified platform.
We report on our experience running the Barrelfish research
multikernel OS on the Intel Single-Chip Cloud Computer (SCC).
We describe the minimal changes required to bring the OS up
on the SCC, and present early performance results from an SCC
system running standalone, and also a single Barrelfish instance
running across a heterogeneous machine consisting of an SCC
and its host PC.
I. INTRODUCTION
The architecture of computer systems is radically chang-
ing: core counts are increasing, systems are becoming more
heterogeneous, and the memory system is becoming less
uniform. As part of this change, it is likely that system-wide
cache-coherent shared memory will no longer exist. This is
happening not only as specialized co-processors, like GPUs,
are more closely integrated with the rest of the system, but
also as core counts increase we expect to see cache coherence
no longer maintained between general purpose cores.
Shared-memory operating systems do not deal with this
complexity and among the several alternative OS models,
one interesting design is to eschew data sharing between
cores and to rely on message passing instead. This enforces
disciplined sharing and enables architecture-agnostic commu-
nication across a number of very different interconnects. In
fact, experimental non-cache-coherent architectures, such as
the Intel Single-Chip Cloud Computer (SCC) [1], already
facilitate message passing with special hardware support.
In this paper, we report on our efforts to port Barrelfish to
the SCC. Barrelfish is an open-source research OS developed
by ETH Zurich and Microsoft Research and is structured
as a multikernel [2]: a distributed system of cores which
communicate exclusively via messages.
The multikernel is a natural fit for the SCC that can fully
leverage the hardware message passing facilities, while requir-
ing only minimal changes to the Barrelfish implementation for
a port from x86 multiprocessors to the SCC. Furthermore, the
Core BCore A
App 2App 1
SharedMemory
MessageQueue
kernelMPB
1. Write
2. SystemCall
3. Write
6. Notify
5.
Read
7. Read
kernel
4. IPI
Fig. 1. Sending of a message between SCC cores
SCC is a good example of the anticipated future system types,
as it is both a non-cache coherent multicore chip, as well as
a host system peripheral.
We describe the modifications to Barrelfish’s message-
passing implementation, the single most important subsystem
needing adaptation. We give initial performance results on the
SCC and across a heterogeneous machine consisting of an
SCC peripheral and its host PC.
II. MESSAGE PASSING DESIGN
Message passing in Barrelfish is implemented by a mes-
sage passing stub and lower-level interconnect and notifi-
cation drivers. The message passing stub is responsible for
(un-)marshaling message arguments into a message queue and
provides the API to applications. Messages are subsequently
sent (received) by the interconnect driver, using the notification
driver to inform the receiver of pending messages. Messages
can be batched to reduce the number of notifications required.
Message passing is performance-critical in Barrelfish and
thus heavily tied to the hardware architecture. In this section,
we describe the interconnect and notification driver design
between cores on the SCC, as well as between host PC and
the SCC. We mention changes to the message passing stub
where appropriate.
A. SCC Inter-core Message Passing
The SCC interconnect driver reliably transports cache-line-
sized messages (32 bytes) through a message queue in non-
coherent shared memory. Shared memory is accessed entirely
from user-space, shown by steps 1 and 7 in Figure 1, using
the SCC write-combine buffer for performance. The polling
approach to detect incoming messages, used by light-weight
and the Single-Chip Cloud Computer
Simon Peter, Adrian Schüpbach, Dominik Menzi and Timothy Roscoe
Systems Group, Department of Computer Science, ETH Zurich
Abstract—Traditional OS architectures based on a single,
shared-memory kernel face significant challenges from hardware
trends, in particular the increasing cost of system-wide cache-
coherence as core counts increase, and the emergence of hetero-
geneous architectures – both on a single die, and also between
CPUs, co-processors like GPUs, and programmable peripherals
within a platform.
The multikernel is an alternative OS model that employs
message passing instead of data sharing and enables architecture-
agnostic inter-core communication, including across non-coherent
shared memory and PCIe peripheral buses. This allows a single
OS instance to manage the complete collection of heterogeneous,
non-cache-coherent processors as a single, unified platform.
We report on our experience running the Barrelfish research
multikernel OS on the Intel Single-Chip Cloud Computer (SCC).
We describe the minimal changes required to bring the OS up
on the SCC, and present early performance results from an SCC
system running standalone, and also a single Barrelfish instance
running across a heterogeneous machine consisting of an SCC
and its host PC.
I. INTRODUCTION
The architecture of computer systems is radically chang-
ing: core counts are increasing, systems are becoming more
heterogeneous, and the memory system is becoming less
uniform. As part of this change, it is likely that system-wide
cache-coherent shared memory will no longer exist. This is
happening not only as specialized co-processors, like GPUs,
are more closely integrated with the rest of the system, but
also as core counts increase we expect to see cache coherence
no longer maintained between general purpose cores.
Shared-memory operating systems do not deal with this
complexity and among the several alternative OS models,
one interesting design is to eschew data sharing between
cores and to rely on message passing instead. This enforces
disciplined sharing and enables architecture-agnostic commu-
nication across a number of very different interconnects. In
fact, experimental non-cache-coherent architectures, such as
the Intel Single-Chip Cloud Computer (SCC) [1], already
facilitate message passing with special hardware support.
In this paper, we report on our efforts to port Barrelfish to
the SCC. Barrelfish is an open-source research OS developed
by ETH Zurich and Microsoft Research and is structured
as a multikernel [2]: a distributed system of cores which
communicate exclusively via messages.
The multikernel is a natural fit for the SCC that can fully
leverage the hardware message passing facilities, while requir-
ing only minimal changes to the Barrelfish implementation for
a port from x86 multiprocessors to the SCC. Furthermore, the
Core BCore A
App 2App 1
SharedMemory
MessageQueue
kernelMPB
1. Write
2. SystemCall
3. Write
6. Notify
5.
Read
7. Read
kernel
4. IPI
Fig. 1. Sending of a message between SCC cores
SCC is a good example of the anticipated future system types,
as it is both a non-cache coherent multicore chip, as well as
a host system peripheral.
We describe the modifications to Barrelfish’s message-
passing implementation, the single most important subsystem
needing adaptation. We give initial performance results on the
SCC and across a heterogeneous machine consisting of an
SCC peripheral and its host PC.
II. MESSAGE PASSING DESIGN
Message passing in Barrelfish is implemented by a mes-
sage passing stub and lower-level interconnect and notifi-
cation drivers. The message passing stub is responsible for
(un-)marshaling message arguments into a message queue and
provides the API to applications. Messages are subsequently
sent (received) by the interconnect driver, using the notification
driver to inform the receiver of pending messages. Messages
can be batched to reduce the number of notifications required.
Message passing is performance-critical in Barrelfish and
thus heavily tied to the hardware architecture. In this section,
we describe the interconnect and notification driver design
between cores on the SCC, as well as between host PC and
the SCC. We mention changes to the message passing stub
where appropriate.
A. SCC Inter-core Message Passing
The SCC interconnect driver reliably transports cache-line-
sized messages (32 bytes) through a message queue in non-
coherent shared memory. Shared memory is accessed entirely
from user-space, shown by steps 1 and 7 in Figure 1, using
the SCC write-combine buffer for performance. The polling
approach to detect incoming messages, used by light-weight
Page 2
SCCHost
App 1 App 2sifsif PCIe
Frame
Frame
Frame
kernel
Frame 1. Write
2. Notify
3. Copy
4. Write 5. Copy
6. IPI 7. Notify
8. Read
9. Notify
10. Read
Fig. 2. Sending of a message from host to SCC
message passing runtimes, such as RCCE [3], is inappropriate
when using shared memory to deliver message payloads, since
each poll of a message-passing channel requires a cache
invalidate followed by a load from DDR3 memory.
Consequently, the notification driver uses inter-core notifica-
tions, implemented within per-core kernels, to signal message
arrival. Notifications are sent by a system call (2) via a ring-
buffer on the receiver’s on-tile message passing buffer (MPB)
and reference shared-memory channels with pending messages
(3). An inter-core interrupt (IPI) is used to inform the peer
kernel of the notification (4), which it forwards to the target
application (6).
At first sight, it may seem odd to use main memory (rather
than the on-tile MPB) for passing message payloads, and to
require a trap to the kernel to send a message notification.
This design is motivated by the need to support many message
channels in Barrelfish and more than one application running
on a core. The SCC’s message-passing functionality does not
appear to have been designed with this use-case in mind. We
discuss this issue further in Section IV.
B. Host-SCC Message Passing
The SCC is connected to a host PC as a PCI express (PCIe)
device and provides access to memory and internal registers
via a system interface (SIF). The host PC can write via the
SIF directly to SCC memory using programmed I/O or the
built-in direct memory access (DMA) engine.
The interconnect-notification driver combination used be-
tween host PC and SCC, called SIFMP, employs two proxy
drivers. One on the host, and one on the SCC. New SIFMP
connections are registered with the local proxy driver. When
the interconnect driver is sending a message by writing to
the local queue (1), the notification driver notifies the proxy
driver (2), which copies the payload to an identical queue
on the other side of the PCIe bus (3). The proxy driver then
forwards the notification to the receiver of the message via
a private message queue (4, 5), sending an IPI to inform the
receiving driver of the notification via its local kernel on the
SCC (6, 7). The peer proxy reads the notification from its
private queue (8) and forwards it to the target application
(9), which receives the message by reading the local copy
via its interconnect driver (10). This implementation, shown
in Figure 2, uses two message queues (one on each side) and
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30 35 40 45 50
La
ten
cy
[c
yc
les
]
Message to core
Overall
Send
Receive
Fig. 3. Average notification latency from core 0 (Overall). Send and Receive
show time spent in sending and receiving, respectively.
two proxy driver connections (one for each driver) for each
SIFMP connection.
III. EVALUATION
We evaluate message passing efficiency by running a series
of messaging benchmarks. All benchmarks execute on a Rocky
Lake board, configured to 533MHz core clock speed, 800MHz
memory mesh speed and 266MHz SIF clock speed. The host
PC is a Sun XFire X2270, clocked to 2.3GHz.
A. Notification via MPB
We use a ping-pong notification experiment to evaluate the
cost of OS-level notification delivery between two peer cores.
Notifications are performance critical to notify a user-space
program on another core of message payload arrival. The
experiment covers the overheads of the system call required to
send the notification from user-space, the actual send via the
MPB and corresponding IPI, and forwarding the notification
to user-space on the receiver.
Figure 3 shows the average latency over 100,000 iterations
of this benchmark between core 0 and each other core, as
well as a break-down into send and receive cost. As expected,
differences in messaging cost due to topology are only notice-
able on the sender, where the cost to write to remote memory
occurs. The relatively large cost of receiving the message is
due to the direct cost of the trap incurred by the IPI, which
we approximated to be 600 cycles, and additional much larger
indirect cost of cache misses associated with the trap.
B. Host-SCC Messaging
We determined the one-way latency of SIFMP for a cache-
line size message from host to SCC to be on the order of
5 million host cycles. As expected from a communication
channel that crosses the PCIe bus, SIFMP is several orders of
magnitude slower than messaging on the host (approximately
1000 cycles). To gain more insight into the latency difference,
we assess the performance of the PCIe proxy driver imple-
mentation, by evaluating read access latency of varying size
from SCC memory to the host PC, using DMA.
App 1 App 2sifsif PCIe
Frame
Frame
Frame
kernel
Frame 1. Write
2. Notify
3. Copy
4. Write 5. Copy
6. IPI 7. Notify
8. Read
9. Notify
10. Read
Fig. 2. Sending of a message from host to SCC
message passing runtimes, such as RCCE [3], is inappropriate
when using shared memory to deliver message payloads, since
each poll of a message-passing channel requires a cache
invalidate followed by a load from DDR3 memory.
Consequently, the notification driver uses inter-core notifica-
tions, implemented within per-core kernels, to signal message
arrival. Notifications are sent by a system call (2) via a ring-
buffer on the receiver’s on-tile message passing buffer (MPB)
and reference shared-memory channels with pending messages
(3). An inter-core interrupt (IPI) is used to inform the peer
kernel of the notification (4), which it forwards to the target
application (6).
At first sight, it may seem odd to use main memory (rather
than the on-tile MPB) for passing message payloads, and to
require a trap to the kernel to send a message notification.
This design is motivated by the need to support many message
channels in Barrelfish and more than one application running
on a core. The SCC’s message-passing functionality does not
appear to have been designed with this use-case in mind. We
discuss this issue further in Section IV.
B. Host-SCC Message Passing
The SCC is connected to a host PC as a PCI express (PCIe)
device and provides access to memory and internal registers
via a system interface (SIF). The host PC can write via the
SIF directly to SCC memory using programmed I/O or the
built-in direct memory access (DMA) engine.
The interconnect-notification driver combination used be-
tween host PC and SCC, called SIFMP, employs two proxy
drivers. One on the host, and one on the SCC. New SIFMP
connections are registered with the local proxy driver. When
the interconnect driver is sending a message by writing to
the local queue (1), the notification driver notifies the proxy
driver (2), which copies the payload to an identical queue
on the other side of the PCIe bus (3). The proxy driver then
forwards the notification to the receiver of the message via
a private message queue (4, 5), sending an IPI to inform the
receiving driver of the notification via its local kernel on the
SCC (6, 7). The peer proxy reads the notification from its
private queue (8) and forwards it to the target application
(9), which receives the message by reading the local copy
via its interconnect driver (10). This implementation, shown
in Figure 2, uses two message queues (one on each side) and
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30 35 40 45 50
La
ten
cy
[c
yc
les
]
Message to core
Overall
Send
Receive
Fig. 3. Average notification latency from core 0 (Overall). Send and Receive
show time spent in sending and receiving, respectively.
two proxy driver connections (one for each driver) for each
SIFMP connection.
III. EVALUATION
We evaluate message passing efficiency by running a series
of messaging benchmarks. All benchmarks execute on a Rocky
Lake board, configured to 533MHz core clock speed, 800MHz
memory mesh speed and 266MHz SIF clock speed. The host
PC is a Sun XFire X2270, clocked to 2.3GHz.
A. Notification via MPB
We use a ping-pong notification experiment to evaluate the
cost of OS-level notification delivery between two peer cores.
Notifications are performance critical to notify a user-space
program on another core of message payload arrival. The
experiment covers the overheads of the system call required to
send the notification from user-space, the actual send via the
MPB and corresponding IPI, and forwarding the notification
to user-space on the receiver.
Figure 3 shows the average latency over 100,000 iterations
of this benchmark between core 0 and each other core, as
well as a break-down into send and receive cost. As expected,
differences in messaging cost due to topology are only notice-
able on the sender, where the cost to write to remote memory
occurs. The relatively large cost of receiving the message is
due to the direct cost of the trap incurred by the IPI, which
we approximated to be 600 cycles, and additional much larger
indirect cost of cache misses associated with the trap.
B. Host-SCC Messaging
We determined the one-way latency of SIFMP for a cache-
line size message from host to SCC to be on the order of
5 million host cycles. As expected from a communication
channel that crosses the PCIe bus, SIFMP is several orders of
magnitude slower than messaging on the host (approximately
1000 cycles). To gain more insight into the latency difference,
we assess the performance of the PCIe proxy driver imple-
mentation, by evaluating read access latency of varying size
from SCC memory to the host PC, using DMA.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
10 Readers on Mendeley
by Discipline
by Academic Status
50% Ph.D. Student
30% Student (Master)
10% Researcher (at an Academic Institution)
by Country
30% Switzerland
20% China
10% United Kingdom


