QuakeTM : Parallelizing a Complex Sequential Application Using Transactional Memory
- ISBN: 9781605584980
- DOI: 10.1145/1542275.1542298
Abstract
"Is transactional memory useful?" is the question that cannot be answered until we provide substantial applications that can evaluate its capabilities. While existing TM applications can partially answer the above question, and are useful in the sense that they provide a first-order TM experimentation framework, they serve only as a proof of concept and fail to make a conclusive case for wide adoption by the general computing community. This paper presents QuakeTM, a multiplayer game server; a complex real life TM application that was parallelized from the sequential version with TM-specific considerations in mind. QuakeTM consists of 27,600 lines of code spread across 49 files and exhibits irregular parallelism for which a task parallel model fits well. We provide a coarse-grained TM implementation characterized with eight large transactional blocks as well as a fine-grained implementation which consists of 58 different critical sections and compare these two approaches. In spite of the fact that QuakeTM scales, we show that more effort is needed to decrease the overhead and the abort rate of current software transactional memory systems to achieve a good performance. We give insights into development challenges, suggest techniques to solve them and provide extensive analysis of the transactional behavior of QuakeTM, with an emphasis and discussion of the TM promise of making parallel programming easier.
Author-supplied keywords
QuakeTM : Parallelizing a Complex Sequential Application Using Transactional Memory
QuakeTM: Parallelizing a Complex Sequential Application
Using Transactional Memory
Vladimir Gajinov†∗ Ferad Zyulkyarov†∗ Osman S. Unsal† Adrian Cristal†
Eduard Ayguade† Tim Harris‡ Mateo Valero†∗
†Barcelona Supercomputing Center ∗Universitat Politecnica de Catalunya ‡Microsoft Research Cambridge
vladimir.gajinov@bsc.es ferad.zyulkyarov@bsc.es osman.unsal@bsc.es adrian.cristal@bsc.es
eduard.ayguade@bsc.es tharris@microsoft.com mateo.valero@bsc.es
ABSTRACT
“Is transactional memory useful?” is the question that cannot be
answered until we provide substantial applications that can
evaluate its capabilities. While existing TM applications can
partially answer the above question, and are useful in the sense
that they provide a first-order TM experimentation framework,
they serve only as a proof of concept and fail to make a
conclusive case for wide adoption by the general computing
community.
This paper presents QuakeTM, a multiplayer game server; a
complex real life TM application that was parallelized from the
sequential version with TM-specific considerations in mind.
QuakeTM consists of 27,600 lines of code spread across 49 files
and exhibits irregular parallelism for which a task parallel model
fits well. We provide a coarse-grained TM implementation
characterized with eight large transactional blocks as well as a
fine-grained implementation which consists of 58 different critical
sections and compare these two approaches. In spite of the fact
that QuakeTM scales, we show that more effort is needed to
decrease the overhead and the abort rate of current software
transactional memory systems to achieve a good performance. We
give insights into development challenges, suggest techniques to
solve them and provide extensive analysis of the transactional
behavior of QuakeTM, with an emphasis and discussion of the
TM promise of making parallel programming easier.
Categories and Subject Descriptors: D.1.3
[Programming Techniques]: Concurrent Programming – Parallel
Programming.
General Terms: Design, Experimentation, Performance.
Keywords: Game Server, Transactional Memory
1. INTRODUCTION
Recently, processor manufacturers have done a right-hand turn
away from increasing single core frequency and complexity. Low
returns from instruction level parallelism (ILP) and problems with
power/heat density have led to the appearance of multi-core
processors that leverage thread level parallelism (TLP). In this
new era of multi-core architectures, the coordination of the work
done by the multiple threads that cooperate in the parallel
execution is one of the challenging issues both in terms of
programming productivity and execution performance.
Transactional memory (TM) is a technology that may help here,
by aiming to provide the performance of fine-grained locking
with the ease-of-programming of coarse-grained critical sections.
In this paper we assess the extent to which this is true of current
TM implementations, based on code descriptions and examples as
well as through performance evaluation.
As a case study we started from a sequential version of Quake, a
complex multi-player game. Using OpenMP and software
transactional memory (STM) we built QuakeTM, a parallel
version which consists of 27,600 lines of code spread across 49
files. Developed in 10 man-months, QuakeTM exhibits irregular
parallelism and long transactions contained within eight different
atomic blocks with large read and write sets.
Our intention was not to pursue performance per se, but to
examine whether or not it is possible to achieve good results with
a coarse-grained parallelization approach. This decision was
driven by one of the hopes for TM, to make parallel programming
easier by abstracting away the complexities of using fine-grained
locking, while still achieving good scalability. When parallelizing
an application from scratch using TM, this kind of coarse-grained
approach is likely to be popular with programmers. Consequently,
this approach needs to be tested on a highly complex application
in order to see how well it works in practice.
This paper makes following contributions:
• We describe how we developed QuakeTM and discuss the
challenges we encountered.
• We show that our implementation scales reasonably well,
despite the use of coarse-grained transactions. However, we
show that this scalability is unable to compensate for the
high overhead and abort rate of the software transactional
memory system.
• Further on, we have adapted the fine-grained TM
implementation described in our previous work on Atomic
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
ICS’09, June 8–12, 2009, Yorktown Heights, New York, USA.
Copyright 2009 ACM 978-1-60558-498-0/09/06...$5.00.
Quake [20] in order to compare these two different
parallelization approaches.
• In a pleasant side-effect during our performance optimization
effort, we developed a simple mechanism, which we call
ReachPoints, that could be useful to discover and isolate
TM-related performance problems.
The remainder of this paper is organized as follows: In Section 2,
we comment on related work. In Section 3, we describe the
structure of the sequential Quake application. In Section 4, we
describe the parallelization process and the development of
QuakeTM. Section 5, details the evaluation environment and
Section 6 follows with the results. In Section 7 we discuss future
work and we conclude in Section 8.
2. RELATED WORK
Early TM research used micro-benchmarks to demonstrate the
potential of the new programming paradigm. Subsequently, sets
of kernel applications have been developed, such as Lee-TM [5],
Delaunay mesh refinement and/or agglomerative clustering
[9][11][17] and STMBench7 [8]. The total code size of these
kernels ranges from 800 lines for Lee-TM to 5000 lines for
STMBench7, but the common fact is that the total size of critical
sections doesn’t exceed a couple of hundred lines of code.
There are several benchmark suites of programs using TM. The
Haskell STM benchmark suite [15], consists of nine Haskell
applications of different sizes, which target different aspects of an
underlying TM system. While it is good for its domain, the
Haskell STM benchmark suite is not directly applicable to other
languages. The STAMP benchmark suite [13], consists of eight
applications that cover a variety of domains and exhibit different
characteristics in terms of transaction lengths, read and write set
sizes and amounts of contention. The downside of these
applications, when used with STM, is the fact that they were
manually optimized, with an application level knowledge
beforehand, which enabled authors to manually implement the
optimal number of read/write barriers in the code. This doesn't
help the effort to prove the primary goal of TM which is to make
multithread programming easier. If programmers are required to
manually instrument the code in order to achieve basic
performance then TM is not the solution. As it was pointed out by
Dalessandro et al. [6] library interfaces can remain a useful tool
for systems researchers, but application programmers are going to
need language and compiler support.
In general, most previous TM applications and benchmarks were
either derived automatically from lock-based parallel versions
(this is, replacing lock-based critical sections with transactions),
or if they were developed from sequential versions then the
resulting code was somewhat limited in size and complexity
(which limits the benefits of detailing the development challenges
and the programmer effort). Therefore, there is a clear need to
develop a substantial TM application from the ground up while
extensively detailing the parallelization process and the
challenges involved. This is one of the main contributions of this
paper.
In recent work [20], we developed a transactional version of
Quake from an existing lock-based version [2][3]. We
encountered a different set of challenges when parallelizing the
sequential version of Quake with TM. Comparing that approach
with the one in this paper gives us a new perspective on how the
use of “atomic” blocks in new parallel code might compare with
their use as a replacement for lock-based critical sections. We
discuss these two issues further in Section 4.
3. QUAKE DESCRIPTION
Quakeworld is the multi-player mode of Quake I, the first person
shooter game released under the GNU general public license by
ID Software. It is a sequential application, built as a client-server
architecture, where the server maintains the game world and
handles coordination between clients, while the clients perform
graphics update and implement user-interface operations.
The server executes an infinite loop, where each iteration
performs the calculation of a single frame. The frame execution
algorithm is presented in Figure 1(a). The server blocks on the
select system call waiting for client requests. If requests are
present on the receiving port, it starts the execution of the new
frame. It is possible to distinguish three stages of frame execution:
world physics update (P), request receiving and processing (R)
and reply stage (S). Upon the end of execution of all three stages
the server frame ends and the process is repeated. Generally, the
server will send replies only to clients which were active in the
current frame, namely those who have sent a request. All replies
are sent after all requests have been processed. This clear
separation of the frame stages simplifies the parallelization, as we
present later in the paper.
The Quake game world is a polygonal representation of the 3D
virtual space in which all objects, including players, are referred
to as entities. Each entity has its own specific characteristics and
actions it can perform. During the update, the server will send
information only for those entities which are of interest to the
client. Nevertheless, the server has to simulate and model, not
only the players’ actions, but also the effects induced by these
actions. Thus, server processing is a complex, compute intensive
task that increases superlinearly with the number of players [3].
3.1 Map Description
A map of the Quake world is represented with a file which holds
the binary space partition (BSP) implementation of the 3D world
with all the details relevant to draw and position the objects in the
world [7][12]. The level of details contained within theBSP tree is
large; therefore BSP trees are hard to maintain for dynamic
scenes. If the server wants to generate a quick list of the objects
that an entity may interact with, traversing the BSP tree is
inefficient, and since this is a common operation involved in each
move command, the server constructs and maintains a secondary
binary-tree structure, called an areanode tree. This is a 2D
representation of the BSP tree, constructed during server
initialization by dividing the 3D volume in the x-y plane. Figure
1(b) demonstrates the building process. Each areanode has an
associated list of objects contained within the space defined by
that areanode. When an object is moved, it is necessary to update
the areanode tree to reflect the new position of the object. This is
done by removing the object from the original list, and inserting it
into the list of the areanode that corresponds to the destination of
the object.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


