Your computer is already a distributed system . Why isn ’ t your OS ?
Computer (2009)
Available from portal.acm.org
or
Abstract
We argue that a new OS for a multicore machine should be designed ground-up as a distributed system, using concepts from that field. Modern hardware resembles a networked system even more than past large multiprocessors:
Available from portal.acm.org
Page 1
Your computer is already a distributed system . Why isn ’ t your OS ?
Your computer is already a distributed system. Why isn’t your OS?
Andrew Baumann Simon Peter Adrian Schüpbach Akhilesh Singhania
Timothy Roscoe Paul Barhamy Rebecca Isaacsy
Systems Group, Department of Computer Science, ETH Zurich
yMicrosoft Research, Cambridge
1 Introduction
We argue that a new OS for a multicore machine should
be designed ground-up as a distributed system, using
concepts from that field. Modern hardware resembles
a networked system even more than past large multipro-
cessors: in addition to familiar latency eects, it exhibits
node heterogeneity and dynamic membership changes.
Cache coherence protocols encourage OS designers to
selectively ignore this, except for limited performance
reasons. Commodity OS designs have mostly assumed
fixed, uniform CPU architectures and memory systems.
It is time for researchers to abandon this approach and
engage fully with the distributed nature of the machine,
carefully applying (but also modifying) ideas from dis-
tributed systems to the design of new operating systems.
Our goal is to make it easier to design and construct ro-
bust OSes that eectively exploit heterogeneous, multi-
core hardware at scale. We approach this through a new
OS architecture resembling a distributed system.
The use of heterogeneous multicore in commodity
computer systems, running dynamic workloads with in-
creased reliance on OS services, will face new challenges
not addressed by monolithic OSes in either general-
purpose or high-performance computing.
It is possible that existing OS architectures can be
evolved and scaled to address these challenges. How-
ever, we claim that stepping back and reconsidering OS
structure is a better way to get insight into the problem,
regardless of whether the goal is to retrofit new ideas to
existing systems, or to replace them over time.
In the next section we elaborate on why modern com-
puters should be thought of as networked systems. Sec-
tion 3 discusses the implications of distributed systems
principles that are pertinent to new OS architecture, and
Section 4 describes a possible architecture for an OS fol-
lowing these ideas. Section 5 lays out open questions at
the intersection of distributed systems and OS research,
and Section 6 concludes.
L3
RAMRAM
L1
L2
CPU
L2
CPU
L1
L2
CPU
L1
L2
CPU
L1
PCIe
PCIe
0
1 3
2 4
5 7
6
HyperTransport links
Figure 1: Node layout of a commodity 32-core machine
2 Observations
A modern computer is undeniably a networked system
of point-to-point links exchanging messages: Figure 1
shows a 32-core commodity PC server in our lab1. But
our argument is more than this: distributed systems
(applications, networks, P2P systems) are historically
distinguished from centralized ones by three additional
challenges: node heterogeneity, dynamic changes due to
partial failures and other reconfigurations, and latency.
Modern computers exhibit all these features.
Heterogeneity: Centralized computer systems tradi-
tionally assume that all the processors which share mem-
ory have the same architecture and performance trade-
os. While a few computers have had heterogeneous
main processors in the past, this is now becoming
the norm in the commodity computer market: vendor
roadmaps show cores on a die with diering instruction
set variants, graphics cards and network interfaces are in-
creasingly programmable, and applications are emerging
for FPGAs plugged into processor sockets.
Managing diverse cores with the same kernel object
code is clearly impossible. At present, some processors
(such as GPUs) are special-cased and abstracted as de-
vices, as a compromise to mainstream operating systems
which cannot easily represent dierent processing archi-
1A Tyan Thunder S4985 board with M4985 Quad CPU card and 8
AMD “Barcelona” quad-core processors. This board is 3 years old.
1
Andrew Baumann Simon Peter Adrian Schüpbach Akhilesh Singhania
Timothy Roscoe Paul Barhamy Rebecca Isaacsy
Systems Group, Department of Computer Science, ETH Zurich
yMicrosoft Research, Cambridge
1 Introduction
We argue that a new OS for a multicore machine should
be designed ground-up as a distributed system, using
concepts from that field. Modern hardware resembles
a networked system even more than past large multipro-
cessors: in addition to familiar latency eects, it exhibits
node heterogeneity and dynamic membership changes.
Cache coherence protocols encourage OS designers to
selectively ignore this, except for limited performance
reasons. Commodity OS designs have mostly assumed
fixed, uniform CPU architectures and memory systems.
It is time for researchers to abandon this approach and
engage fully with the distributed nature of the machine,
carefully applying (but also modifying) ideas from dis-
tributed systems to the design of new operating systems.
Our goal is to make it easier to design and construct ro-
bust OSes that eectively exploit heterogeneous, multi-
core hardware at scale. We approach this through a new
OS architecture resembling a distributed system.
The use of heterogeneous multicore in commodity
computer systems, running dynamic workloads with in-
creased reliance on OS services, will face new challenges
not addressed by monolithic OSes in either general-
purpose or high-performance computing.
It is possible that existing OS architectures can be
evolved and scaled to address these challenges. How-
ever, we claim that stepping back and reconsidering OS
structure is a better way to get insight into the problem,
regardless of whether the goal is to retrofit new ideas to
existing systems, or to replace them over time.
In the next section we elaborate on why modern com-
puters should be thought of as networked systems. Sec-
tion 3 discusses the implications of distributed systems
principles that are pertinent to new OS architecture, and
Section 4 describes a possible architecture for an OS fol-
lowing these ideas. Section 5 lays out open questions at
the intersection of distributed systems and OS research,
and Section 6 concludes.
L3
RAMRAM
L1
L2
CPU
L2
CPU
L1
L2
CPU
L1
L2
CPU
L1
PCIe
PCIe
0
1 3
2 4
5 7
6
HyperTransport links
Figure 1: Node layout of a commodity 32-core machine
2 Observations
A modern computer is undeniably a networked system
of point-to-point links exchanging messages: Figure 1
shows a 32-core commodity PC server in our lab1. But
our argument is more than this: distributed systems
(applications, networks, P2P systems) are historically
distinguished from centralized ones by three additional
challenges: node heterogeneity, dynamic changes due to
partial failures and other reconfigurations, and latency.
Modern computers exhibit all these features.
Heterogeneity: Centralized computer systems tradi-
tionally assume that all the processors which share mem-
ory have the same architecture and performance trade-
os. While a few computers have had heterogeneous
main processors in the past, this is now becoming
the norm in the commodity computer market: vendor
roadmaps show cores on a die with diering instruction
set variants, graphics cards and network interfaces are in-
creasingly programmable, and applications are emerging
for FPGAs plugged into processor sockets.
Managing diverse cores with the same kernel object
code is clearly impossible. At present, some processors
(such as GPUs) are special-cased and abstracted as de-
vices, as a compromise to mainstream operating systems
which cannot easily represent dierent processing archi-
1A Tyan Thunder S4985 board with M4985 Quad CPU card and 8
AMD “Barcelona” quad-core processors. This board is 3 years old.
1
Page 2
Access cycles normalized to L1 per-hop cost
L1 cache 2 1 -
L2 cache 15 7.5 -
L3 cache 75 37.5 -
Other L1/L2 130 65 -
1-hop cache 190 95 60
2-hop cache 260 130 70
Table 1: Latency of cache access for the PC in Figure 1.
tectures within the same kernel. In other cases, a pro-
grammable peripheral itself runs its own (embedded) OS.
So far, no coherent OS architecture has emerged which
accommodates such processors in a single framework.
Given their radical dierences, a distributed systems ap-
proach, with well-defined interfaces between the soft-
ware on these cores, seems the only viable one.
Dynamicity: Nodes in a distributed system come and
go, as a result of changes in provisioning, failures, net-
work anomalies, etc. The hardware of a computer from
the OS perspective is not viewed in this manner.
Partial failure in a single computer is not a mainstream
concern, though recently failure of parts of the OS has
become one [15], a recognition that monolithic kernels
are now too complex, and written by too many people, to
be considered a single unit of failure.
However, other sources of dynamicity within a single
OS are now commonplace. Hot-plugging of devices, and
in some cases memory and processors, is becoming the
norm. Increasingly sophisticated power management al-
lows cores, memory systems, and peripheral devices to
be put into a variety of low-power states, with important
implications for how the OS functions: if a peripheral
bus controller is powered down, all the peripherals on
that bus become inaccessible. If a processor core is sus-
pended, any processes on that core are unreachable until
it resumes, unless they are migrated.
Communication latency: The problem of latency in
cache-coherent NUMA machines is well-known. Table
1 shows the dierent latencies of accessing cache on the
machine in Figure 1, in line with those reported by Boyd-
Wickizer et al. for a 16-core machine [5]2. Accessing a
cache line from a dierent core is up to 130 times slower
than from local L1 cache, and 17 times slower than lo-
cal L2 cache. The trend is towards more cores and an
increasingly complex memory hierarchy.
Because most operating systems coordinate data struc-
tures between cores using shared memory, OS designs
focus on optimizing this in the face of memory latency.
While locks can scale to large machines [12], locality of
2Numbers for main memory are complex due to the broadcast na-
ture of the cache coherence protocol, and do not add to our argument.
data becomes a performance problem [2], since fetching
remote memory eectively causes the hardware to per-
form an RPC call. In Section 3 we suggest insights from
distributed systems which can help mitigate this.
Summary: A single machine today consists of a dy-
namically changing collection of heterogeneous pro-
cessing elements, communicating via channels (whether
messages or shared-memory) with diverse, but often
highly significant, latencies. In other words, it has all the
classic characteristics of a distributed, networked system.
Why are we not programming it as such?
3 Implications
We now present some ideas, concepts, and principles
from distributed systems which we feel are particularly
relevant to OS design for modern machines. We draw
parallels with existing ideas in operating systems, and
where there is no corresponding notion, suggest how the
idea might be applied. We have found that viewing OS
problems as distributed systems problems either suggests
new solutions to current problems, or fruitfully casts new
light on known OS techniques.
Message passing vs. shared memory: Traversing a
shared data structure in a modern cache-coherent system
is equivalent to a series of synchronous RPCs to fetch
remote cache lines. For a complex data structure, this
means lots of round trips, whose performance is limited
by the latency and bandwidth of the interconnect.
In distributed systems, “chatty” RPCs are reduced by
encoding the high-level operation more compactly; in the
limit, this becomes a single RPC or code shipping. When
communication is expensive (in latency or bandwidth), it
is more ecient to send a compact message encoding
a complex operation than to access the data remotely.
While there was much research in the 1990s into dis-
tributed shared virtual memory on clusters (e.g. [1]), it is
rarely used today.
In an OS, a message-passing primitive can make more
ecient use of the interconnect and reduce latency over
sharing data structures between cores. If an operation
on a data structure and its results can each be compactly
encoded in less than a cache line, a carefully written and
ecient user-level RPC [3] implementation which leaves
the data where it is in a remote cache can incur less over-
head in terms of total machine cycles. Moreover, the use
of message passing rather than shared data facilitates in-
teroperation between heterogeneous processors.
Note that this line of reasoning is independent of
the overhead for synchronization (e.g. through scalable
locks). The performance issues arise less from lock con-
tention than from data locality issues, an observation
which has been made before [2].
2
L1 cache 2 1 -
L2 cache 15 7.5 -
L3 cache 75 37.5 -
Other L1/L2 130 65 -
1-hop cache 190 95 60
2-hop cache 260 130 70
Table 1: Latency of cache access for the PC in Figure 1.
tectures within the same kernel. In other cases, a pro-
grammable peripheral itself runs its own (embedded) OS.
So far, no coherent OS architecture has emerged which
accommodates such processors in a single framework.
Given their radical dierences, a distributed systems ap-
proach, with well-defined interfaces between the soft-
ware on these cores, seems the only viable one.
Dynamicity: Nodes in a distributed system come and
go, as a result of changes in provisioning, failures, net-
work anomalies, etc. The hardware of a computer from
the OS perspective is not viewed in this manner.
Partial failure in a single computer is not a mainstream
concern, though recently failure of parts of the OS has
become one [15], a recognition that monolithic kernels
are now too complex, and written by too many people, to
be considered a single unit of failure.
However, other sources of dynamicity within a single
OS are now commonplace. Hot-plugging of devices, and
in some cases memory and processors, is becoming the
norm. Increasingly sophisticated power management al-
lows cores, memory systems, and peripheral devices to
be put into a variety of low-power states, with important
implications for how the OS functions: if a peripheral
bus controller is powered down, all the peripherals on
that bus become inaccessible. If a processor core is sus-
pended, any processes on that core are unreachable until
it resumes, unless they are migrated.
Communication latency: The problem of latency in
cache-coherent NUMA machines is well-known. Table
1 shows the dierent latencies of accessing cache on the
machine in Figure 1, in line with those reported by Boyd-
Wickizer et al. for a 16-core machine [5]2. Accessing a
cache line from a dierent core is up to 130 times slower
than from local L1 cache, and 17 times slower than lo-
cal L2 cache. The trend is towards more cores and an
increasingly complex memory hierarchy.
Because most operating systems coordinate data struc-
tures between cores using shared memory, OS designs
focus on optimizing this in the face of memory latency.
While locks can scale to large machines [12], locality of
2Numbers for main memory are complex due to the broadcast na-
ture of the cache coherence protocol, and do not add to our argument.
data becomes a performance problem [2], since fetching
remote memory eectively causes the hardware to per-
form an RPC call. In Section 3 we suggest insights from
distributed systems which can help mitigate this.
Summary: A single machine today consists of a dy-
namically changing collection of heterogeneous pro-
cessing elements, communicating via channels (whether
messages or shared-memory) with diverse, but often
highly significant, latencies. In other words, it has all the
classic characteristics of a distributed, networked system.
Why are we not programming it as such?
3 Implications
We now present some ideas, concepts, and principles
from distributed systems which we feel are particularly
relevant to OS design for modern machines. We draw
parallels with existing ideas in operating systems, and
where there is no corresponding notion, suggest how the
idea might be applied. We have found that viewing OS
problems as distributed systems problems either suggests
new solutions to current problems, or fruitfully casts new
light on known OS techniques.
Message passing vs. shared memory: Traversing a
shared data structure in a modern cache-coherent system
is equivalent to a series of synchronous RPCs to fetch
remote cache lines. For a complex data structure, this
means lots of round trips, whose performance is limited
by the latency and bandwidth of the interconnect.
In distributed systems, “chatty” RPCs are reduced by
encoding the high-level operation more compactly; in the
limit, this becomes a single RPC or code shipping. When
communication is expensive (in latency or bandwidth), it
is more ecient to send a compact message encoding
a complex operation than to access the data remotely.
While there was much research in the 1990s into dis-
tributed shared virtual memory on clusters (e.g. [1]), it is
rarely used today.
In an OS, a message-passing primitive can make more
ecient use of the interconnect and reduce latency over
sharing data structures between cores. If an operation
on a data structure and its results can each be compactly
encoded in less than a cache line, a carefully written and
ecient user-level RPC [3] implementation which leaves
the data where it is in a remote cache can incur less over-
head in terms of total machine cycles. Moreover, the use
of message passing rather than shared data facilitates in-
teroperation between heterogeneous processors.
Note that this line of reasoning is independent of
the overhead for synchronization (e.g. through scalable
locks). The performance issues arise less from lock con-
tention than from data locality issues, an observation
which has been made before [2].
2
Page 3
Explicit message-based communication is amenable
to both informal and formal analysis, using techniques
such as process calculi and queueing theory. In contrast,
although they have been seen by some as easier to pro-
gram, it is notoriously dicult to prove correctness re-
sults about, or predict the performance of, systems based
on implicit shared-memory communication.
Long ago, Lauer and Needham [11] pointed out the
equivalence of shared-memory and message passing in
OS structure, arguing that the choice should be guided by
what is best supported by the underlying substrate. More
recently, Chaves et al. [7] considered the same choice
in the context of an operating system for an early mul-
tiprocessor, finding the performance tradeo biased to-
wards message passing for many kernel operations. We
claim that hardware today strongly motivates a message-
passing model, and refine this broad claim below.
Replication of data is used in distributed systems to in-
crease throughput for read-mostly workloads and to in-
crease availability, and there are clear parallels in operat-
ing systems. Processor caches and TLBs replicate data in
hardware for performance, with the OS sometimes han-
dling consistency as with TLB shootdown.
The performance benefits of replicating in software
have not been lost on OS designers. Tornado [10] repli-
cated (as well as partitioned) OS state across cores to
improve scalability and reduce memory contention, and
fos [17] argues for “fleets” of replicated OS servers.
Commodity OSes such as Linux now replicate cached
pages and read-only data such as program text [4].
However, replication in current OSes is treated as an
optimization of the shared data model. We suggest that it
is useful to instead see replication as the default model,
and that programming interfaces should reflect this – in
other words, OS code should be written as if it accessed
a local replica of data rather than a single shared copy.
The principal impact on clients is that they now in-
voke an agreement protocol (propose a change to system
state, and later receive agreement or failure notification)
rather than modifying data under a lock or transaction.
The change of model is important because it provides a
uniform way to synchronize state across heterogeneous
processors that may not coherently share memory.
At scale we expect the overhead of a relatively long-
running, split-phase (asynchronous) agreement protocol
over a lightweight transport such as URPC to be less than
the heavyweight global barrier model of inter-processor
interrupts (IPIs), particularly as the usual batching and
pipelining optimizations also apply with agreement. In
eect, we are trading increased latency o against lower
overhead for operations.
Of course, sometimes sharing is cheap. Replication of
data across closely coupled cores, such as those sharing
L2 or L3 cache, is likely to perform substantially slower
than spinlocks. However, we can reintroduce sharing and
locks (or transactions) for groups of cores as an optimiza-
tion behind the replication interface.
This is a reversal of the scalability trend in kernels
whereby replication and partitioning is used to optimize
locks and sharing; in contrast, we advocate using locks
and limited sharing to optimize replica maintenance.
Consistency: Maintaining the consistency of replicas
in current operating systems is a fairly simple aair. Typ-
ically an initiator synchronously contacts all cores, often
via a global IPI, and waits for a confirmation. In the case
of TLB shootdown, where this occurs frequently, it is a
well-known scalability bottleneck. Some existing opti-
mizations for global TLB shootdown are familiar from
distributed systems, such as deferring and coalescing up-
dates. Uhlig’s TLB shootdown algorithm [16] appears to
be a form of asynchronous single-phase commit.
However, the design space for agreement and consen-
sus protocols is large and mostly unexploited in OS de-
sign. Operations such as page mapping, file I/O, network
connection setup and teardown, etc. have varying consis-
tency and ordering requirements, and distributed systems
have much to oer in insights to these problems, partic-
ularly as systems become more concurrent and diverse.
Furthermore, the ability of consensus protocols to
agree on ordering of operations even when some par-
ticipants are (temporarily) unavailable seems relevant to
the problem of ensuring that processors which have been
powered down subsequently resume with their OS state
consistent before restarting processes. This is tricky code
to write, and the OS world currently lacks a good frame-
work within which to think about such operations.
Just as importantly, reasoning about OS state as a set
of consistent replicas with explicit constraints on the or-
dering of updates seems to hold out more hope of as-
suring correctness of a heterogeneous multiprocessor OS
than low-level analysis of locking and critical sections.
Network eects: Perhaps surprisingly, routing, con-
gestion, and queueing eects within a computer are al-
ready an issue. Conway and Hughes [8] document the
challenges for platform firmware in setting up routing ta-
bles and sizing forwarding queues in a HyperTransport-
based multiprocessor, and point out that link congestion
(as distinct from memory contention) is a performance
problem, an eect we have replicated on AMD hardware
in our lab. These are classical networking problems.
Closer to the level of system software, routing prob-
lems emerge when considering where to place buers
in memory as data flows through processors, DMA con-
trollers, memory, and peripherals. For example, data that
arrives at a machine and is immediately forwarded back
over the network should be placed in buers close to the
3
to both informal and formal analysis, using techniques
such as process calculi and queueing theory. In contrast,
although they have been seen by some as easier to pro-
gram, it is notoriously dicult to prove correctness re-
sults about, or predict the performance of, systems based
on implicit shared-memory communication.
Long ago, Lauer and Needham [11] pointed out the
equivalence of shared-memory and message passing in
OS structure, arguing that the choice should be guided by
what is best supported by the underlying substrate. More
recently, Chaves et al. [7] considered the same choice
in the context of an operating system for an early mul-
tiprocessor, finding the performance tradeo biased to-
wards message passing for many kernel operations. We
claim that hardware today strongly motivates a message-
passing model, and refine this broad claim below.
Replication of data is used in distributed systems to in-
crease throughput for read-mostly workloads and to in-
crease availability, and there are clear parallels in operat-
ing systems. Processor caches and TLBs replicate data in
hardware for performance, with the OS sometimes han-
dling consistency as with TLB shootdown.
The performance benefits of replicating in software
have not been lost on OS designers. Tornado [10] repli-
cated (as well as partitioned) OS state across cores to
improve scalability and reduce memory contention, and
fos [17] argues for “fleets” of replicated OS servers.
Commodity OSes such as Linux now replicate cached
pages and read-only data such as program text [4].
However, replication in current OSes is treated as an
optimization of the shared data model. We suggest that it
is useful to instead see replication as the default model,
and that programming interfaces should reflect this – in
other words, OS code should be written as if it accessed
a local replica of data rather than a single shared copy.
The principal impact on clients is that they now in-
voke an agreement protocol (propose a change to system
state, and later receive agreement or failure notification)
rather than modifying data under a lock or transaction.
The change of model is important because it provides a
uniform way to synchronize state across heterogeneous
processors that may not coherently share memory.
At scale we expect the overhead of a relatively long-
running, split-phase (asynchronous) agreement protocol
over a lightweight transport such as URPC to be less than
the heavyweight global barrier model of inter-processor
interrupts (IPIs), particularly as the usual batching and
pipelining optimizations also apply with agreement. In
eect, we are trading increased latency o against lower
overhead for operations.
Of course, sometimes sharing is cheap. Replication of
data across closely coupled cores, such as those sharing
L2 or L3 cache, is likely to perform substantially slower
than spinlocks. However, we can reintroduce sharing and
locks (or transactions) for groups of cores as an optimiza-
tion behind the replication interface.
This is a reversal of the scalability trend in kernels
whereby replication and partitioning is used to optimize
locks and sharing; in contrast, we advocate using locks
and limited sharing to optimize replica maintenance.
Consistency: Maintaining the consistency of replicas
in current operating systems is a fairly simple aair. Typ-
ically an initiator synchronously contacts all cores, often
via a global IPI, and waits for a confirmation. In the case
of TLB shootdown, where this occurs frequently, it is a
well-known scalability bottleneck. Some existing opti-
mizations for global TLB shootdown are familiar from
distributed systems, such as deferring and coalescing up-
dates. Uhlig’s TLB shootdown algorithm [16] appears to
be a form of asynchronous single-phase commit.
However, the design space for agreement and consen-
sus protocols is large and mostly unexploited in OS de-
sign. Operations such as page mapping, file I/O, network
connection setup and teardown, etc. have varying consis-
tency and ordering requirements, and distributed systems
have much to oer in insights to these problems, partic-
ularly as systems become more concurrent and diverse.
Furthermore, the ability of consensus protocols to
agree on ordering of operations even when some par-
ticipants are (temporarily) unavailable seems relevant to
the problem of ensuring that processors which have been
powered down subsequently resume with their OS state
consistent before restarting processes. This is tricky code
to write, and the OS world currently lacks a good frame-
work within which to think about such operations.
Just as importantly, reasoning about OS state as a set
of consistent replicas with explicit constraints on the or-
dering of updates seems to hold out more hope of as-
suring correctness of a heterogeneous multiprocessor OS
than low-level analysis of locking and critical sections.
Network eects: Perhaps surprisingly, routing, con-
gestion, and queueing eects within a computer are al-
ready an issue. Conway and Hughes [8] document the
challenges for platform firmware in setting up routing ta-
bles and sizing forwarding queues in a HyperTransport-
based multiprocessor, and point out that link congestion
(as distinct from memory contention) is a performance
problem, an eect we have replicated on AMD hardware
in our lab. These are classical networking problems.
Closer to the level of system software, routing prob-
lems emerge when considering where to place buers
in memory as data flows through processors, DMA con-
trollers, memory, and peripherals. For example, data that
arrives at a machine and is immediately forwarded back
over the network should be placed in buers close to the
3
Page 4
NIC, whereas data that will be read in its entirety should
be DMAed to memory local to the computing core [14].
Heterogeneity (and interoperability) have long been
key challenges in distributed systems, and have been
tackled at the data level using standardized messaging
protocols, and at the interface level using logical de-
scriptions of distributed services that software can reason
about (e.g. [9]), the most ambitious being the ontology
languages used for the semantic web.
We have proposed an analogous, though simpler, ap-
proach based on constraint logic programming to allow
an OS (and applications) to make sense of the diverse and
complex hardware on which it finds itself running [14].
4 The multikernel architecture
In this section, we sketch out the multikernel architecture
for an operating system built from the ground up as a dis-
tributed system, targeting modern multicore processors,
intelligent peripherals, and heterogeneous multiproces-
sors, and incorporating the ideas above.
We do not believe that this is the only, or even neces-
sarily the best, structure for such a system. However, it
is useful for two related reasons. Firstly, it represents an
extreme point in the design space, and hence serves as
a useful vehicle to investigate the full consequences of
viewing the machine as a networked system. Secondly,
it is designed with total disregard for compatibility with
either Windows or POSIX. In practice, we can achieve
compatibility with sub-optimal performance by running
a VMM over the OS [13], and this gives us the freedom
to investigate OS APIs better suited to both modern hard-
ware and the scheduling and I/O requirements of modern
concurrent language runtimes.
Message-based communication: Cross-core sharing
in a multikernel is avoided by default; instead, each core
runs its own, local OS node, as shown in Figure 2.
In keeping with the message-passing model, all com-
munication between cores is asynchronous (split-phase).
On current commodity hardware, we use URPC [3] as
our message transport, with recourse to IPIs only when
necessary for synchronization. Other transports may be
used over non-coherent links (such as PCIe), or for fu-
ture hardware. Other than URPC buers, no data struc-
tures whatsoever are shared between OS nodes. One of
the consequences of this is that the OS code for dierent
cores can be implemented entirely dierently (and, for
example, specialized for a given architecture).
Replication and consistency: Global consistency of
replicated state in the OS is handled by message passing
between OS nodes. All OS operations from user-space
processes that aect global state are split-phase, and are
x86
Async messages
App
x64 ARM GPU
App App
OS node OS node OS node OS node
State
replica
State
replica
State
replica
State
replica
App
Agreement
algor it hms
Heterogeneous
cores
Arch-specif ic
code
Figure 2: The multikernel architecture
mediated by the local OS node, which obtains agreement
where required from the other OS nodes in the system
and performs privileged operations locally.
For example, an OS node would change a user-space
memory mapping by a distributed two-phase commit,
first tentatively changing the local mapping, then initi-
ating agreement among other cores possessing the map-
ping, and finally committing the change (or aborting if a
conflicting mapping on another core preempted it).
Heterogeneity: We address heterogeneity in two ways.
First, all or part of the OS node can be specialized to the
core on which it runs. Second, we maintain a rich and
detailed representation of machine hardware and perfor-
mance tradeos in a service [14] that facilitates online
reasoning about code placement, routing, buer alloca-
tion, etc.
5 Open questions
We do not advocate blindly importing distributed sys-
tems ideas into OS design, and applying such ideas use-
fully in an OS is rarely straightforward. However, many
of the challenges lead to their own interesting research
directions.
What are the appropriate algorithms, and how do
they scale? Intuitively, distributed algorithms based on
messages should scale just as well as cache-coherent
shared memory, since at some level the latter is based on
the former. Passing compact encodings of complex op-
erations in our messages should show clear wins, but this
is still to be demonstrated empirically. It is ironic that,
years after microkernels failed due to the high cost of
messages versus memory access, OS design may adopt
message passing for the opposite reason.
There are many more relevant areas in distributed
computing than we have mentioned here. For exam-
ple, leadership and group membership algorithms may
be useful in handling hot-plugging of devices and CPUs.
Second-guessing the cache coherence protocol. On
current commodity hardware, the cache coherence pro-
4
be DMAed to memory local to the computing core [14].
Heterogeneity (and interoperability) have long been
key challenges in distributed systems, and have been
tackled at the data level using standardized messaging
protocols, and at the interface level using logical de-
scriptions of distributed services that software can reason
about (e.g. [9]), the most ambitious being the ontology
languages used for the semantic web.
We have proposed an analogous, though simpler, ap-
proach based on constraint logic programming to allow
an OS (and applications) to make sense of the diverse and
complex hardware on which it finds itself running [14].
4 The multikernel architecture
In this section, we sketch out the multikernel architecture
for an operating system built from the ground up as a dis-
tributed system, targeting modern multicore processors,
intelligent peripherals, and heterogeneous multiproces-
sors, and incorporating the ideas above.
We do not believe that this is the only, or even neces-
sarily the best, structure for such a system. However, it
is useful for two related reasons. Firstly, it represents an
extreme point in the design space, and hence serves as
a useful vehicle to investigate the full consequences of
viewing the machine as a networked system. Secondly,
it is designed with total disregard for compatibility with
either Windows or POSIX. In practice, we can achieve
compatibility with sub-optimal performance by running
a VMM over the OS [13], and this gives us the freedom
to investigate OS APIs better suited to both modern hard-
ware and the scheduling and I/O requirements of modern
concurrent language runtimes.
Message-based communication: Cross-core sharing
in a multikernel is avoided by default; instead, each core
runs its own, local OS node, as shown in Figure 2.
In keeping with the message-passing model, all com-
munication between cores is asynchronous (split-phase).
On current commodity hardware, we use URPC [3] as
our message transport, with recourse to IPIs only when
necessary for synchronization. Other transports may be
used over non-coherent links (such as PCIe), or for fu-
ture hardware. Other than URPC buers, no data struc-
tures whatsoever are shared between OS nodes. One of
the consequences of this is that the OS code for dierent
cores can be implemented entirely dierently (and, for
example, specialized for a given architecture).
Replication and consistency: Global consistency of
replicated state in the OS is handled by message passing
between OS nodes. All OS operations from user-space
processes that aect global state are split-phase, and are
x86
Async messages
App
x64 ARM GPU
App App
OS node OS node OS node OS node
State
replica
State
replica
State
replica
State
replica
App
Agreement
algor it hms
Heterogeneous
cores
Arch-specif ic
code
Figure 2: The multikernel architecture
mediated by the local OS node, which obtains agreement
where required from the other OS nodes in the system
and performs privileged operations locally.
For example, an OS node would change a user-space
memory mapping by a distributed two-phase commit,
first tentatively changing the local mapping, then initi-
ating agreement among other cores possessing the map-
ping, and finally committing the change (or aborting if a
conflicting mapping on another core preempted it).
Heterogeneity: We address heterogeneity in two ways.
First, all or part of the OS node can be specialized to the
core on which it runs. Second, we maintain a rich and
detailed representation of machine hardware and perfor-
mance tradeos in a service [14] that facilitates online
reasoning about code placement, routing, buer alloca-
tion, etc.
5 Open questions
We do not advocate blindly importing distributed sys-
tems ideas into OS design, and applying such ideas use-
fully in an OS is rarely straightforward. However, many
of the challenges lead to their own interesting research
directions.
What are the appropriate algorithms, and how do
they scale? Intuitively, distributed algorithms based on
messages should scale just as well as cache-coherent
shared memory, since at some level the latter is based on
the former. Passing compact encodings of complex op-
erations in our messages should show clear wins, but this
is still to be demonstrated empirically. It is ironic that,
years after microkernels failed due to the high cost of
messages versus memory access, OS design may adopt
message passing for the opposite reason.
There are many more relevant areas in distributed
computing than we have mentioned here. For exam-
ple, leadership and group membership algorithms may
be useful in handling hot-plugging of devices and CPUs.
Second-guessing the cache coherence protocol. On
current commodity hardware, the cache coherence pro-
4
Page 5
tocol is ultimately our message transport. Good perfor-
mance relies on a tight mapping between high-level mes-
sages and the movement of cache lines, using techniques
like URPC. At the same time, cache coherence provides
facilities (such as reliable broadcast) which may permit
novel optimizations in distributed algorithms for agree-
ment and the like. To further complicate matters, some
parts of the machine (such as main memory) may be
cache-coherent but others (such as programmable NICs
and GPUs) might not be, and our algorithms should per-
form well over this heterogeneous network.
The illusion of shared memory. Our focus is scal-
ing the OS, and thereby improving performance of OS-
intensive workloads, but the same arguments apply to ap-
plication code, an issue we intend to investigate. Nat-
urally, a multikernel should provide applications with
a shared-memory model if desired. However, while it
could be argued that shared memory is a simpler pro-
gramming model for applications, systems like Disco [6]
have been motivated by the opposite claim: that it is eas-
ier to build a scalable application from nodes (perhaps
small multiprocessors) communicating using messages.
A separate question concerns whether future multicore
designs will remain cache-coherent, or opt instead for a
dierent communication model (such as that used in the
Cell processor). A multikernel seems to oer the best
options here. As in some HPC designs, we may come to
view scalable cache-coherency hardware as an unneces-
sary luxury with better alternatives in software.
Where does the analogy break? There are important
dierences limiting the degree to which distributed algo-
rithms can be applied to OS design. Many arise from the
hardware-based message transport, such as fixed transfer
sizes, no ability to do in-network aggregation, static rout-
ing, and the need to poll for incoming messages. Others
(reliable messaging, broadcast, simpler failure models)
may allow novel optimizations.
Why stop at the edge of the box? Viewing a machine
as a distributed system makes the boundary between ma-
chines (traditionally the network interface) less clear-
cut, and more a question of degree (overhead, latency,
bandwidth, reliability). Some colleagues have therefore
suggested extending a multikernel-like OS across physi-
cal machines, or incorporating further networking ideas
(such as Byzantine fault tolerance) within a machine. We
are cautious (even skeptical) about these ideas, even in
the long term, but they remain intriguing.
Perhaps less radical is to look at how structuring a
single-node OS as a distributed system might make it
more suitable as part of a larger physically distributed
system, in an environment such as a data center.
6 Conclusion
Modern computers are inherently distributed systems,
and we miss opportunities to tackle the OS challenges
of new hardware if we ignore insights from distributed
systems research. We have tried to come out of denial by
applying the resulting ideas to a new OS architecture, the
multikernel.
An implementation, Barrelfish, is in progress.
References
[1] C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Raja-
mony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared memory
computing on networks of workstations. IEEE Computer, 29(2),
1996.
[2] J. Appavoo, D. Da Silva, O. Krieger, M. Auslander, M. Os-
trowski, B. Rosenburg, A. Waterland, R. W. Wisniewski,
J. Xenidis, M. Stumm, and L. Soares. Experience distributing
objects in an SMMP OS. ACM TOCS, 25(3), 2007.
[3] B. N. Bershad, T. E. Anderson, E. D. Lazowska, and H. M. Levy.
User-level interprocess communication for shared memory mul-
tiprocessors. ACM TOCS, 9(2):175–198, 1991.
[4] M. J. Bligh, M. Dobson, D. Hart, and G. Huizenga. Linux on
NUMA systems. In Ottawa Linux Symp., Jul 2004.
[5] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek,
R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and
Z. Zhang. Corey: An operating system for many cores. In Proc.
8th OSDI, Dec 2008.
[6] E. Bugnion, S. Devine, K. Govil, and M. Rosenblum. Disco:
running commodity operating systems on scalable multiproces-
sors. ACM TOCS, 15(4):412–447, 1997.
[7] E. M. Chaves, Jr., P. C. Das, T. J. LeBlanc, B. D. Marsh, and
M. L. Scott. Kernel–Kernel communication in a shared-memory
multiprocessor. Concurrency: Pract. & Exp., 5(3), 1993.
[8] P. Conway and B. Hughes. The AMD Opteron northbridge ar-
chitecture. IEEE Micro, 27(2):10–21, 2007.
[9] J.-P. Deschrevel. The ANSA model for trading and federation.
Architecture Report APM.1005.1, APM Ltd., Jul 1993. http:
//www.ansa.co.uk/ANSATech/93/Primary/100501.pdf.
[10] B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm. Tornado:
maximizing locality and concurrency in a shared memory mul-
tiprocessor operating system. In Proc. 3rd OSDI, 1999.
[11] H. C. Lauer and R. M. Needham. On the duality of operating
systems structures. In Proc. 2nd Int. Symp. on Operat. Syst.,
IRIA, 1978. reprinted in ACM Operat. Syst. Rev., 13(2), 1979.
[12] J. M. Mellor-Crummey and M. L. Scott. Algorithms for scal-
able synchronization on shared-memory multiprocessors. ACM
TOCS, 9:21–65, 1991.
[13] T. Roscoe, K. Elphinstone, and G. Heiser. Hype and virtue. In
Proc. 11th HotOS, San Diego, CA, USA, May 2007.
[14] A. Schüpbach, S. Peter, A. Baumann, T. Roscoe, P. Barham,
T. Harris, and R. Isaacs. Embracing diversity in the Barrelfish
manycore operating system. In Workshop on Managed Many-
Core Systems, Boston, MA, USA, Jun 2008.
[15] M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the
reliability of commodity operating systems. In Proc. 19th SOSP,
pages 207–222, 2003.
[16] V. Uhlig. Scalability of Microkernel-Based Systems. PhD thesis,
University of Karlsruhe, Germany, Jun 2005.
[17] D. Wentzla and A. Agarwal. Factored operating systems (fos):
The case for a scalable operating system for multicores. Operat.
Syst. Rev., 43(2), Apr 2009.
5
mance relies on a tight mapping between high-level mes-
sages and the movement of cache lines, using techniques
like URPC. At the same time, cache coherence provides
facilities (such as reliable broadcast) which may permit
novel optimizations in distributed algorithms for agree-
ment and the like. To further complicate matters, some
parts of the machine (such as main memory) may be
cache-coherent but others (such as programmable NICs
and GPUs) might not be, and our algorithms should per-
form well over this heterogeneous network.
The illusion of shared memory. Our focus is scal-
ing the OS, and thereby improving performance of OS-
intensive workloads, but the same arguments apply to ap-
plication code, an issue we intend to investigate. Nat-
urally, a multikernel should provide applications with
a shared-memory model if desired. However, while it
could be argued that shared memory is a simpler pro-
gramming model for applications, systems like Disco [6]
have been motivated by the opposite claim: that it is eas-
ier to build a scalable application from nodes (perhaps
small multiprocessors) communicating using messages.
A separate question concerns whether future multicore
designs will remain cache-coherent, or opt instead for a
dierent communication model (such as that used in the
Cell processor). A multikernel seems to oer the best
options here. As in some HPC designs, we may come to
view scalable cache-coherency hardware as an unneces-
sary luxury with better alternatives in software.
Where does the analogy break? There are important
dierences limiting the degree to which distributed algo-
rithms can be applied to OS design. Many arise from the
hardware-based message transport, such as fixed transfer
sizes, no ability to do in-network aggregation, static rout-
ing, and the need to poll for incoming messages. Others
(reliable messaging, broadcast, simpler failure models)
may allow novel optimizations.
Why stop at the edge of the box? Viewing a machine
as a distributed system makes the boundary between ma-
chines (traditionally the network interface) less clear-
cut, and more a question of degree (overhead, latency,
bandwidth, reliability). Some colleagues have therefore
suggested extending a multikernel-like OS across physi-
cal machines, or incorporating further networking ideas
(such as Byzantine fault tolerance) within a machine. We
are cautious (even skeptical) about these ideas, even in
the long term, but they remain intriguing.
Perhaps less radical is to look at how structuring a
single-node OS as a distributed system might make it
more suitable as part of a larger physically distributed
system, in an environment such as a data center.
6 Conclusion
Modern computers are inherently distributed systems,
and we miss opportunities to tackle the OS challenges
of new hardware if we ignore insights from distributed
systems research. We have tried to come out of denial by
applying the resulting ideas to a new OS architecture, the
multikernel.
An implementation, Barrelfish, is in progress.
References
[1] C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Raja-
mony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared memory
computing on networks of workstations. IEEE Computer, 29(2),
1996.
[2] J. Appavoo, D. Da Silva, O. Krieger, M. Auslander, M. Os-
trowski, B. Rosenburg, A. Waterland, R. W. Wisniewski,
J. Xenidis, M. Stumm, and L. Soares. Experience distributing
objects in an SMMP OS. ACM TOCS, 25(3), 2007.
[3] B. N. Bershad, T. E. Anderson, E. D. Lazowska, and H. M. Levy.
User-level interprocess communication for shared memory mul-
tiprocessors. ACM TOCS, 9(2):175–198, 1991.
[4] M. J. Bligh, M. Dobson, D. Hart, and G. Huizenga. Linux on
NUMA systems. In Ottawa Linux Symp., Jul 2004.
[5] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek,
R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and
Z. Zhang. Corey: An operating system for many cores. In Proc.
8th OSDI, Dec 2008.
[6] E. Bugnion, S. Devine, K. Govil, and M. Rosenblum. Disco:
running commodity operating systems on scalable multiproces-
sors. ACM TOCS, 15(4):412–447, 1997.
[7] E. M. Chaves, Jr., P. C. Das, T. J. LeBlanc, B. D. Marsh, and
M. L. Scott. Kernel–Kernel communication in a shared-memory
multiprocessor. Concurrency: Pract. & Exp., 5(3), 1993.
[8] P. Conway and B. Hughes. The AMD Opteron northbridge ar-
chitecture. IEEE Micro, 27(2):10–21, 2007.
[9] J.-P. Deschrevel. The ANSA model for trading and federation.
Architecture Report APM.1005.1, APM Ltd., Jul 1993. http:
//www.ansa.co.uk/ANSATech/93/Primary/100501.pdf.
[10] B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm. Tornado:
maximizing locality and concurrency in a shared memory mul-
tiprocessor operating system. In Proc. 3rd OSDI, 1999.
[11] H. C. Lauer and R. M. Needham. On the duality of operating
systems structures. In Proc. 2nd Int. Symp. on Operat. Syst.,
IRIA, 1978. reprinted in ACM Operat. Syst. Rev., 13(2), 1979.
[12] J. M. Mellor-Crummey and M. L. Scott. Algorithms for scal-
able synchronization on shared-memory multiprocessors. ACM
TOCS, 9:21–65, 1991.
[13] T. Roscoe, K. Elphinstone, and G. Heiser. Hype and virtue. In
Proc. 11th HotOS, San Diego, CA, USA, May 2007.
[14] A. Schüpbach, S. Peter, A. Baumann, T. Roscoe, P. Barham,
T. Harris, and R. Isaacs. Embracing diversity in the Barrelfish
manycore operating system. In Workshop on Managed Many-
Core Systems, Boston, MA, USA, Jun 2008.
[15] M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the
reliability of commodity operating systems. In Proc. 19th SOSP,
pages 207–222, 2003.
[16] V. Uhlig. Scalability of Microkernel-Based Systems. PhD thesis,
University of Karlsruhe, Germany, Jun 2005.
[17] D. Wentzla and A. Agarwal. Factored operating systems (fos):
The case for a scalable operating system for multicores. Operat.
Syst. Rev., 43(2), Apr 2009.
5
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
38 Readers on Mendeley
by Discipline
3% Engineering
by Academic Status
55% Ph.D. Student
21% Student (Master)
8% Other Professional
by Country
37% United States
13% China
11% France


