x86-TSO : A Rigorous and Usable Programmer ’ s Model for x86 Multiprocessors
- ISSN: 00010782
- DOI: 10.1145/1785414.1785443
Abstract
Exploiting the multiprocessors that have recently become ubiquitous requires high-performance and reliable concurrent systems code, for concurrent data structures, operating system kernels, synchronization libraries, compilers, and so on. However, concurrent programming, which is always challenging, is made much more so by two problems. First, real multiprocessors typically do not provide the sequentially consistent memory that is assumed by most work on semantics and verification. Instead, they have relaxed memory models, varying in subtle ways between processor families, in which different hardware threads may have only loosely consistent views of a shared memory. Second, the public vendor architectures, supposedly specifying what programmers can rely on, are often in ambiguous informal prose (a particularly poor medium for loose specifications), leading to widespread confusion.
x86-TSO : A Rigorous and Usable Programmer ’ s Model for x86 Multiprocessors
x86-TSO: A Rigorous and Usable Programmer’s Model for
x86 Multiprocessors
Peter Sewell
University of Cambridge
Susmit Sarkar
University of Cambridge
Scott Owens
University of Cambridge
Francesco Zappa Nardelli
INRIA
Magnus O. Myreen
University of Cambridge
http://www.cl.cam.ac.uk/users/pes20/weakmemory
ABSTRACT
Exploiting the multiprocessors that have recently become
ubiquitous requires high-performance and reliable concur-
rent systems code, for concurrent data structures, operat-
ing system kernels, synchronisation libraries, compilers, and
so on. However, concurrent programming, which is always
challenging, is made much more so by two problems. First,
real multiprocessors typically do not provide the sequentially
consistent memory that is assumed by most work on seman-
tics and verification. Instead, they have relaxed memory
models, varying in subtle ways between processor families,
in which different hardware threads may have only loosely
consistent views of a shared memory. Second, the public
vendor architectures, supposedly specifying what program-
mers can rely on, are often in ambiguous informal prose (a
particularly poor medium for loose specifications), leading
to widespread confusion.
In this paper we focus on x86 processors. We review sev-
eral recent Intel and AMD specifications, showing that all
contain serious ambiguities, some are arguably too weak to
program above, and some are simply unsound with respect
to actual hardware. We present a new x86-TSO program-
mer’s model that, to the best of our knowledge, suffers from
none of these problems. It is mathematically precise (rig-
orously defined in HOL4) but can be presented as an intu-
itive abstract machine which should be widely accessible to
working programmers. We illustrate how this can be used to
reason about the correctness of a Linux spinlock implemen-
tation and describe a general theory of data-race-freedom for
x86-TSO. This should put x86 multiprocessor system build-
ing on a more solid foundation; it should also provide a basis
for future work on verification of such systems.
1. INTRODUCTION
Multiprocessor machines, with many processors acting on
a shared memory, have been developed since the 1960s; they
are now ubiquitous. Meanwhile, the difficulty of program-
ming concurrent systems has motivated extensive research
on programming language design, semantics, and verifica-
tion, from semaphores and monitors to program logics, soft-
ware model checking, and so forth. This work has almost al-
ways assumed that concurrent threads share a single sequen-
tially consistent memory [21], with their reads and writes
interleaved in some order. In fact, however, real multipro-
cessors use sophisticated techniques to achieve high perfor-
mance: store buffers, hierarchies of local cache, speculative
execution, etc. These optimisations are not observable by
sequential code, but in multithreaded programs different
threads may see subtly different views of memory; such ma-
chines exhibit relaxed, or weak, memory models [6, 17, 19,
7].
For a simple example, consider the following assembly lan-
guage program (SB) for modern Intel or AMD x86 multipro-
cessors: given two distinct memory locations x and y (ini-
tially holding 0), if two processors respectively write 1 to
x and y and then read from y and x (into register EAX on
processor 0 and EBX on processor 1), it is possible for both
to read 0 in the same execution. It is easy to check that this
result cannot arise from any interleaving of the reads and
writes of the two processors; modern x86 multiprocessors do
not have a sequentially consistent semantics.
SB
Proc 0 Proc 1
MOV [x]←1 MOV [y]←1
MOV EAX←[y] MOV EBX←[x]
Allowed Final State: Proc 0:EAX=0 ∧ Proc 1:EBX=0
Microarchitecturally, one can view this particular example
as a visible consequence of store buffering: if each proces-
sor effectively has a FIFO buffer of pending memory writes
(to avoid the need to block while a write completes), then
the reads from y and x could occur before the writes have
propagated from the buffers to main memory.
Other families of multiprocessors, dating back at least to
the IBM 370, and including ARM, Itanium, POWER, and
SPARC, also exhibit relaxed-memory behaviour. Moreover,
there are major and subtle differences between different pro-
cessor families (arising from their different internal design
choices): in the details of exactly what non-sequentially-
consistent executions they permit, and of what memory bar-
rier and synchronisation instructions they provide to let the
programmer regain control.
For any of these processors, relaxed-memory behaviour ex-
acerbates the difficulties of writing concurrent software, as
systems programmers cannot reason, at the level of abstrac-
tion of memory reads and writes, in terms of an intuitive
concept of global time.
Still worse, while some vendors’ architectural specifica-
tions clearly define what they guarantee, others do not,
despite the extensive previous research on relaxed memory
models. We focus in this paper on x86 processors. In Sec-
tion 2 we introduce the key examples and discuss several
vendor specifications, showing that they all leave key ques-
tions ambiguous, some give unusably weak guarantees, and
1
processors do exhibit.
For there to be any hope of building reliable multipro-
cessor software, systems programmers need to understand
what relaxed-memory behaviour they can rely on, but at
present that understanding exists only in folklore, not in
clear public specifications. To remedy this, we aim to pro-
duce mathematically precise (but still appropriately loose)
programmer’s models for real-world multiprocessors, to in-
form the intuition of systems programmers, to provide a
sound foundation for rigorous reasoning about multiproces-
sor programs, and to give a clear correctness criterion for
hardware. In Section 3 we describe a simple x86 memory
model, x86-TSO [27]. In contrast to those vendor specifi-
cations it is unambiguous, defined in rigorous mathematics,
but it is also accessible, presented in an operational abstract-
machine style. To the best of our knowledge it is consistent
with the behaviour of actual processors. We consider the rel-
evant vendor litmus tests in Section 3.2 and describe some
empirical test results in Section 3.3.
Relaxed memory behaviour is particularly critical for low-
level systems code: synchronisation libraries, concurrent
data structure libraries, language runtime systems, compil-
ers for concurrent languages, and so on. To reason (even
informally) about such code, such as the implementation of
an OS mutual exclusion lock, one would necessarily depend
on the details of a specific model. Higher-level application
code, on the other hand, should normally be oblivious to the
underlying processor memory model. The usual expectation
is that such code is in some sense race free, with all access to
shared memory (except for accesses within the library code)
protected by locks or clearly identified as synchronisation ac-
cesses. Most memory models are designed with the intention
that such race-free code behaves as if it were executing on a
sequentially consistent machine. In Section 4 we describe an
implementation of spin locks for x86, from one version of the
Linux kernel, and discuss informally why it is correct with
respect to x86-TSO. In Section 5 we define a precise notion of
data race for x86 and discuss results showing that programs
that use spin locks but are otherwise race-free (except for
the races within the lock implementation) do indeed behave
as if executing on a sequentially consistent machine [26].
To support formal reasoning about programs, a memory
model must be integrated with a semantics for machine in-
structions (a problem which has usually been neglected in
the relaxed memory literature). In previous work [31, §3]
we describe a semantics for core x86 instructions, with sev-
eral innovations. We take care not to over-sequentialise the
memory accesses within each instruction, parameterising the
instruction semantics over parallel and sequential combina-
tors. A single definition, with all the intricacies of flag-
setting, addressing modes, etc., can then be used to generate
both an event-based semantics that can be integrated with
memory models, and a state-based semantics for sequen-
tial programs; the latter enables us to test the semantics
against implementations. We also build an instruction de-
coding function, directly from the vendor documentation, to
support reasoning about concrete machine code.
The intended scope of x86-TSO is typical user code and
most kernel code: we cover programs using coherent write-
back memory, without exceptions, misaligned or mixed-
size accesses, ‘non-temporal’ operations (e.g. MOVNTI), self-
modifying code, or page-table changes. Within this domain,
and together with our earlier instruction semantics, x86-
TSO thus defines a complete semantics of programs.
Relaxed memory models play an important role also in
the design of high-level concurrent languages such as Java
or C++0x, where programs are subject not just to the mem-
ory model of the underlying processor but also to reorder-
ings introduced by compiler optimisations. The Java Mem-
ory Model [24] attempts to ensure that data-race free pro-
grams are sequentially consistent; all programs satisfy mem-
ory safety/security properties; and common compiler opti-
misations are sound. Unfortunately, as shown by Sˇevcˇ´ık and
Aspinall [33], the last goal is not met. In the future, we hope
that it will be possible to prove correctness of implementa-
tions of language-level memory models above the models
provided by real-world processors; ensuring that both are
precisely and clearly specified is a first step towards that
goal.
2. ARCHITECTURE SPECIFICATIONS
To describe what programmers can rely on, processor
vendors document architectures. These are loose specifica-
tions, claimed to cover a range of past and future processor
implementations, which should specify processor behaviour
tightly enough to enable effective programming, but with-
out unduly constraining future processor designs. For some
architectures the memory-model aspects of these specifica-
tions are expressed in reasonably precise mathematics, as
in the normative Appendix K of the SPARC v.8 specifica-
tion [2]. For x86, however, the vendor architecture speci-
fications are informal prose documents. Informal prose is
a poor medium for loose specification of subtle properties,
and, as we shall see, such documents are almost inevitably
ambiguous and sometimes wrong. Moreover, one cannot test
programs above such a vague specification (one can only run
programs on particular actual processors), and one cannot
use them as criteria for testing processor implementations.
In this section we review the informal-prose Intel and AMD
x86 specifications: the Intel 64 and IA-32 Architectures Soft-
ware Developer’s Manual (SDM) [5] and the AMD64 Archi-
tecture Programmer’s Manual (APM) [3]. There have been
several versions of these, some differing radically; we con-
trast them with each other, and with what we have discov-
ered of the behaviour of actual processors. In the process
we introduce the key discriminating examples.
2.1 pre-IWP (before Aug. 2007)
Early revisions of the Intel SDM (e.g. rev. 22, Nov. 2006)
gave an informal-prose model called ‘processor ordering’,
unsupported by any examples. It is hard to see precisely
what this prose means, especially without additional knowl-
edge or assumptions about the microarchitecture of particu-
lar implementations. The uncertainty about x86 behaviour
that at least some systems programmers had about earlier
IA-32 processors can be gauged from an extensive discus-
sion about the correctness of a proposed optimisation to a
Linux spinlock implementation [1]. The discussion is largely
in microarchitectural terms, not just in terms of the speci-
fied architecture, and seems to have been resolved only with
input from Intel staff. We return to this optimisation in Sec-
tion 4, where we can explain why it is sound with respect to
x86-TSO.
2.2 IWP/AMD3.14/x86-CC
2
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


