Parallel Monte Carlo Simulation Of Systems With Local Interactions
Available from
Andrew Jackson's profile on Mendeley.
Page 1
Parallel Monte Carlo Simulation Of Systems With Local Interactions
This line only printed with preprint option
Parallel Monte Carlo Simulation For Local Interactions
A.N. Jackson∗
SUPA, School of Physics, University of Edinburgh,
Edinburgh EH9 3JZ, Scotland, United Kingdom
Abstract
We present a new parallelization scheme for Monte Carlo simulation. The algorithm employs
an elegant orchestration and message passing scheme in order to optimally utilize the inherent
parallelism in the Markov chain of systems with short-range interactions. The scheme has the
advantage that absolutely no changes are made to the overall Markov chain in the parallel form of
the algorithm, and the multi-processor simulation can be mapped directly onto the serial version.
We apply the algorithm to the hard sphere fluid, for which a new algorithm for preparing dense
random packings is presented. We benchmark the parallel code, and present perliminary results
examining the structure of the hard-sphere fluids.
∗Electronic address: A.N.Jackson@ed.ac.uk; URL: http://code.google.com/p/mcsimpi/
1
Parallel Monte Carlo Simulation For Local Interactions
A.N. Jackson∗
SUPA, School of Physics, University of Edinburgh,
Edinburgh EH9 3JZ, Scotland, United Kingdom
Abstract
We present a new parallelization scheme for Monte Carlo simulation. The algorithm employs
an elegant orchestration and message passing scheme in order to optimally utilize the inherent
parallelism in the Markov chain of systems with short-range interactions. The scheme has the
advantage that absolutely no changes are made to the overall Markov chain in the parallel form of
the algorithm, and the multi-processor simulation can be mapped directly onto the serial version.
We apply the algorithm to the hard sphere fluid, for which a new algorithm for preparing dense
random packings is presented. We benchmark the parallel code, and present perliminary results
examining the structure of the hard-sphere fluids.
∗Electronic address: A.N.Jackson@ed.ac.uk; URL: http://code.google.com/p/mcsimpi/
1
Page 2
I. INTRODUCTION
As access to multi-core processors and massively parallel hardware becomes more com-
mon, while raw CPU clock-speed continues to plateau, the need to exploit concurrent re-
sources for scientific modelling becomes more pressing. In condensed matter (and related
fields) atomistic simulation is widely used to model the emergent behaviour of large systems
of particles. The two main techniques employed for this type of modelling are Molecular
Dynamics (MD) and Monte Carlo simulation (MC) [???]. The MD approach has already
been successfully extended to concurrent hardware, as the time-consuming force calculation
is straightforward to parallelize, and this has inspired the development of a wide range of
concurrent algorithms for different types of MD calculations [1]. Unfortunately, there has
been less success in finding efficient ways to exploit parallel hardware for MC simulation [1].
In some cases, it is sufficient to employ trivial parallelism by running multiple instances
of the same MC simulation concurrently. Each is initialised with different random number
generator (RNG) seeds, allowing the parallel hardware to be exploited in order to signif-
icantly increase the statistical accuracy of the estimated observables [???]. A number of
related variance-reduction techniques are linked to this kind of “embarrassingly parallel” ap-
proach. For example, multiple future states and their acceptance probabilities can calculated
in parallel, in order to explore the space of possible Markov chains more effectively [2].
Despite their simplicity, these concurrency schemes are often preferable to full paral-
lelization – if the behaviour of interest can be modelled by a system small enough to run
on one Processing Element (PE), then running N independent instances of the same system
is always going to give estimates for observables of higher statistical accuracy than for one
steadily-evolving system that is N times larger. Nevertheless, these smaller simulations are
not always sufficient, and we must also be able to model very large systems, composed of so
many particles that the entire simulation cannot fit into the memory resources available to
a single processing unit. For long-range interactions, like Columbic potentials, the compu-
tational cost of determining the force on each particle is so large that it is best to follow the
example of MD and parallelise the force calculation itself, while the actual time evolution
proceeds in a serial fashion [???]. However, many interesting model systems employ local
Hamiltonians, from simple lattice systems like the Ising model, through to short-range pair-
potentials like the hard sphere, Yukawa and Lennard-Jones interactions. If we are concerned
2
As access to multi-core processors and massively parallel hardware becomes more com-
mon, while raw CPU clock-speed continues to plateau, the need to exploit concurrent re-
sources for scientific modelling becomes more pressing. In condensed matter (and related
fields) atomistic simulation is widely used to model the emergent behaviour of large systems
of particles. The two main techniques employed for this type of modelling are Molecular
Dynamics (MD) and Monte Carlo simulation (MC) [???]. The MD approach has already
been successfully extended to concurrent hardware, as the time-consuming force calculation
is straightforward to parallelize, and this has inspired the development of a wide range of
concurrent algorithms for different types of MD calculations [1]. Unfortunately, there has
been less success in finding efficient ways to exploit parallel hardware for MC simulation [1].
In some cases, it is sufficient to employ trivial parallelism by running multiple instances
of the same MC simulation concurrently. Each is initialised with different random number
generator (RNG) seeds, allowing the parallel hardware to be exploited in order to signif-
icantly increase the statistical accuracy of the estimated observables [???]. A number of
related variance-reduction techniques are linked to this kind of “embarrassingly parallel” ap-
proach. For example, multiple future states and their acceptance probabilities can calculated
in parallel, in order to explore the space of possible Markov chains more effectively [2].
Despite their simplicity, these concurrency schemes are often preferable to full paral-
lelization – if the behaviour of interest can be modelled by a system small enough to run
on one Processing Element (PE), then running N independent instances of the same system
is always going to give estimates for observables of higher statistical accuracy than for one
steadily-evolving system that is N times larger. Nevertheless, these smaller simulations are
not always sufficient, and we must also be able to model very large systems, composed of so
many particles that the entire simulation cannot fit into the memory resources available to
a single processing unit. For long-range interactions, like Columbic potentials, the compu-
tational cost of determining the force on each particle is so large that it is best to follow the
example of MD and parallelise the force calculation itself, while the actual time evolution
proceeds in a serial fashion [???]. However, many interesting model systems employ local
Hamiltonians, from simple lattice systems like the Ising model, through to short-range pair-
potentials like the hard sphere, Yukawa and Lennard-Jones interactions. If we are concerned
2
Page 3
with physical dynamics, then these last two potentials are best treated by MD [???]. But,
if we are interested only in the long-time behaviour, then MC can offer a more rapid route
to equilibrium [???]. Furthermore, hard spheres and other hard-core systems are difficult to
simulate via MD, and the ballistic algorithm does not scale well to large systems unless the
hard-core constraint is relaxed [???].
One approach to parallelizing a Monte Carlo simulation applies when the order of at-
tempted operations can be simplified, adopting a more organized scheme than the usual
random move selection. For example, when simulating the Ising model, it is possible to use
typewriter [???] or red/black update schemes [???]. Similar schemes have been introduced
for off-lattice systems, like the parallel typewriter update scheme of Heffelfinger [Heff1995].
This type of scheme is relatively simple to parallelize, and indeed can gain almost perfect
speed-up [???], but the ordered nature of the updates has the potential to introduce spu-
rious correlations into the phase-space dynamics and to impede equilibration. Although
the chances of any persistent circulation of probability though the phase space should even
out to zero over time, it is difficult to prove this, and even more difficult to estimate the
time-scale over which the system will fully equilibrate. In some cases, the choice of update
scheme can actually alter the results of the simulation, as in the case the Ising model which
can become demonstrably non-ergodic under ordered updates, circulating through a small
selection of microstates deterministically instead of sampling the phase space fully [???].
The most advanced parallel Monte Carlo scheme in the literature is the random ac-
tive/passive domains formulation presented by Uhlherr et al [3, 4]. By randomly moving
and flipping the fixed and active domains, as opposed to using the regular active-domain
exchange of the parallel typewriter scheme [Heff1995], their algorithm ensures that no cir-
culation of state-space probability flux can persist for long times. The correctness proof
provided for their algorithm shows that any and all introduced fluxes will cancel out even-
tually, via the integration over all domain positions [3, 4]. Nevertheless, this algorithm still
breaks strict detailed balance, as every time the domain walls are changed and the particle
that was last moved is switched from an active to a passive domain, the reverse move is
suddenly rendered impossible. Indeed, as the authors note, if the run is short compared to
the domain change time, then is is possible for the motion of the domain wall to introduce
some transient correlations into the system, and it is difficult to rule out the potential for
ergodic problems, if only for pathological cases.
3
if we are interested only in the long-time behaviour, then MC can offer a more rapid route
to equilibrium [???]. Furthermore, hard spheres and other hard-core systems are difficult to
simulate via MD, and the ballistic algorithm does not scale well to large systems unless the
hard-core constraint is relaxed [???].
One approach to parallelizing a Monte Carlo simulation applies when the order of at-
tempted operations can be simplified, adopting a more organized scheme than the usual
random move selection. For example, when simulating the Ising model, it is possible to use
typewriter [???] or red/black update schemes [???]. Similar schemes have been introduced
for off-lattice systems, like the parallel typewriter update scheme of Heffelfinger [Heff1995].
This type of scheme is relatively simple to parallelize, and indeed can gain almost perfect
speed-up [???], but the ordered nature of the updates has the potential to introduce spu-
rious correlations into the phase-space dynamics and to impede equilibration. Although
the chances of any persistent circulation of probability though the phase space should even
out to zero over time, it is difficult to prove this, and even more difficult to estimate the
time-scale over which the system will fully equilibrate. In some cases, the choice of update
scheme can actually alter the results of the simulation, as in the case the Ising model which
can become demonstrably non-ergodic under ordered updates, circulating through a small
selection of microstates deterministically instead of sampling the phase space fully [???].
The most advanced parallel Monte Carlo scheme in the literature is the random ac-
tive/passive domains formulation presented by Uhlherr et al [3, 4]. By randomly moving
and flipping the fixed and active domains, as opposed to using the regular active-domain
exchange of the parallel typewriter scheme [Heff1995], their algorithm ensures that no cir-
culation of state-space probability flux can persist for long times. The correctness proof
provided for their algorithm shows that any and all introduced fluxes will cancel out even-
tually, via the integration over all domain positions [3, 4]. Nevertheless, this algorithm still
breaks strict detailed balance, as every time the domain walls are changed and the particle
that was last moved is switched from an active to a passive domain, the reverse move is
suddenly rendered impossible. Indeed, as the authors note, if the run is short compared to
the domain change time, then is is possible for the motion of the domain wall to introduce
some transient correlations into the system, and it is difficult to rule out the potential for
ergodic problems, if only for pathological cases.
3
Page 4
All of these domain schemes introduce a new time-scale – the time over which the changes
and exchange of the active and passive domains can be ignored. While such schemes are
clearly correct given sufficient run-time, and appear to give good results in practice, it is
still desirable to attempt a parallelization of a Monte Carlo Markov chain algorithm that
does not introduce any extra correlations to the dynamics. In this paper, we present a
new algorithm that optimally exploits the inherent parallelism present in a Monte Carlo
simulation, without altering the overall Markov chain in any way. This algorithm contains
no further approximations or time-scales over and above those in the serial version of the
code, and the Markov sequence is executed just as it would be on a serial machine.
We use the Ising model as a simple abstract example, allowing the overall approach to be
elucidated in a simple one-dimensional system. A concrete implementation of this algorithm
is presented for the hard sphere fluid, along with a new algorithm for generating the initial
particle distribution. This simple model system is interesting as a need has been identified
for new algorithms capable of preparing dense hard-sphere fluids [Torquato2000] and because
the ability to rapidly equilibrate hard sphere fluids is useful as an input to more complex
multi-component simulations [???]. Finally, using the hard-sphere fluid as an initial model
allows easy future generalisation of the code to any other short-ranged pair potential.
II. PARALLEL MARKOV CHAIN MONTE CARLO
We wish to parallelize the Metropolis algorithm, as applied to problems of equilibrium
statistical physics in the canonical ensemble [5]. Generalizations to other ensembles and
systems will be discussed later. As shown in Figure 1, at each step, we choose a particle at
random, generate a new position for that particle, and accept the move with probability:
Pacc = min
{
1.0, exp
[
−
∆H
kBT
]}
. (1)
This is repeated N times per simulation time-step, where N is the number of particles in
the system, so that each particle is moved once per Monte Carlo sweep (MCS) on average.
Given that the Hamiltonian is local, and as the acceptance probability in eq. 1 depends
only on the change in energy associated with the trial move, it is clear that each individual
move can be performed using only local information. To see exactly how this can be done,
we must first decompose the Markov chain to separate the global and the local operations.
4
and exchange of the active and passive domains can be ignored. While such schemes are
clearly correct given sufficient run-time, and appear to give good results in practice, it is
still desirable to attempt a parallelization of a Monte Carlo Markov chain algorithm that
does not introduce any extra correlations to the dynamics. In this paper, we present a
new algorithm that optimally exploits the inherent parallelism present in a Monte Carlo
simulation, without altering the overall Markov chain in any way. This algorithm contains
no further approximations or time-scales over and above those in the serial version of the
code, and the Markov sequence is executed just as it would be on a serial machine.
We use the Ising model as a simple abstract example, allowing the overall approach to be
elucidated in a simple one-dimensional system. A concrete implementation of this algorithm
is presented for the hard sphere fluid, along with a new algorithm for generating the initial
particle distribution. This simple model system is interesting as a need has been identified
for new algorithms capable of preparing dense hard-sphere fluids [Torquato2000] and because
the ability to rapidly equilibrate hard sphere fluids is useful as an input to more complex
multi-component simulations [???]. Finally, using the hard-sphere fluid as an initial model
allows easy future generalisation of the code to any other short-ranged pair potential.
II. PARALLEL MARKOV CHAIN MONTE CARLO
We wish to parallelize the Metropolis algorithm, as applied to problems of equilibrium
statistical physics in the canonical ensemble [5]. Generalizations to other ensembles and
systems will be discussed later. As shown in Figure 1, at each step, we choose a particle at
random, generate a new position for that particle, and accept the move with probability:
Pacc = min
{
1.0, exp
[
−
∆H
kBT
]}
. (1)
This is repeated N times per simulation time-step, where N is the number of particles in
the system, so that each particle is moved once per Monte Carlo sweep (MCS) on average.
Given that the Hamiltonian is local, and as the acceptance probability in eq. 1 depends
only on the change in energy associated with the trial move, it is clear that each individual
move can be performed using only local information. To see exactly how this can be done,
we must first decompose the Markov chain to separate the global and the local operations.
4
Page 5
Pick a particle at random.
Generate a new position.
Calculate ∆Ε.
Accept/reject.
Figure 1: Core of the serial Monte Carlo Markov chain algorithm. The large grey box indicates
the stages that depend on the state of the particles of the system, e.g. the values of the spins in
an Ising model.
A. Markov Chain & Domain Decomposition
As suggested by Figure 1, the Markov chain in a Monte Carlo simulation can be decom-
posed into two parts. The global part, which does not depend on the local state, and the
local part, which depends only on the local state. The global part of the Markov chain is
in the initial move generation, where the particle to be moved is selected at random. The
rest of the algorithm, (generating the actual move, calculating the energy change, and ac-
cepting or rejecting the move), depends on the actual values of the local state variables.
If we consider the random particle selection as the core of the global Markov chain, and
the rest of the algorithm as the local consequences of that chain, then the opportunity for
parallelization becomes clearer. Figure 2 illustrates the natural opportunity for parallelism
in a one-dimensional Ising magnet evolving under the Metropolis algorithm, by showing that
because only nearby particles will be affected by each move, there is a high degree of natural
potential for parallelization.
Clearly, there is much to be gained here, as the code only needs to execute in serial when
neighbouring processors are required to update neighbouring sites upon sequential moves
5
Generate a new position.
Calculate ∆Ε.
Accept/reject.
Figure 1: Core of the serial Monte Carlo Markov chain algorithm. The large grey box indicates
the stages that depend on the state of the particles of the system, e.g. the values of the spins in
an Ising model.
A. Markov Chain & Domain Decomposition
As suggested by Figure 1, the Markov chain in a Monte Carlo simulation can be decom-
posed into two parts. The global part, which does not depend on the local state, and the
local part, which depends only on the local state. The global part of the Markov chain is
in the initial move generation, where the particle to be moved is selected at random. The
rest of the algorithm, (generating the actual move, calculating the energy change, and ac-
cepting or rejecting the move), depends on the actual values of the local state variables.
If we consider the random particle selection as the core of the global Markov chain, and
the rest of the algorithm as the local consequences of that chain, then the opportunity for
parallelization becomes clearer. Figure 2 illustrates the natural opportunity for parallelism
in a one-dimensional Ising magnet evolving under the Metropolis algorithm, by showing that
because only nearby particles will be affected by each move, there is a high degree of natural
potential for parallelization.
Clearly, there is much to be gained here, as the code only needs to execute in serial when
neighbouring processors are required to update neighbouring sites upon sequential moves
5
Page 6
time
sp
ac
e
time
PE
0
PE
1
PE
2
PE
3
PE
0
PE
1
PE
2
PE
3
time
time
PE
0
PE
1
PE
2
PE
3
a)
b)
c)
d)
Figure 2: Inherent parallelism in the one-dimensional Ising model. Each space-time plot shows a
one-dimensional lattice where a spin flip is attempted at one site at each moment in time. The
black squares indicate the site that has been selected for a trial move, and the grey squares indicate
the neighbouring sites that are needed to calculate the energy change. a) The initial Markov chain
of sites selected for trial moves. b) The same sequence, decomposed over four processing elements.
c) As in (b), but the ‘idle’ time in-between operations on each PE has been shaded grey. d) The
‘idle’ time will be automatically exploited, and the sequence of moves is compressed into a shorter
overall run time.
6
sp
ac
e
time
PE
0
PE
1
PE
2
PE
3
PE
0
PE
1
PE
2
PE
3
time
time
PE
0
PE
1
PE
2
PE
3
a)
b)
c)
d)
Figure 2: Inherent parallelism in the one-dimensional Ising model. Each space-time plot shows a
one-dimensional lattice where a spin flip is attempted at one site at each moment in time. The
black squares indicate the site that has been selected for a trial move, and the grey squares indicate
the neighbouring sites that are needed to calculate the energy change. a) The initial Markov chain
of sites selected for trial moves. b) The same sequence, decomposed over four processing elements.
c) As in (b), but the ‘idle’ time in-between operations on each PE has been shaded grey. d) The
‘idle’ time will be automatically exploited, and the sequence of moves is compressed into a shorter
overall run time.
6
Page 7
in the Markov Chain. This becomes increasingly unlikely as the number of dimensions,
and thus the processor domain surface area, is increased, and so allows the communication
work to be neatly overlapped with the local computations. For the sake of simplicity, we
consider a one-dimensional domain decomposition here (as shown in Figure 2), although this
approach is suitable for any spatial domain decomposition. Each processor will manage an
equal volume of the overall system, sliced up along one axis. This local and global simulation
structure will be generalised to off-lattice systems in section III.
B. Implicit Orchestration
At first sight, exploiting this inherent parallelism seems impossible to do efficiently. Each
process must be informed of the sequence of moves, and if a neighbouring processor is mov-
ing a particle that is in the halo, the processor must await its turn. This seems to imply a
requirement of at least one message sent and one received for each processor, for each and
every particle move. As communications are usually an order of magnitude slower than sim-
ple local computations [???], this would lead to extremely poor performance. This problem
of orchestrating the simulation is the core issue that must be faced upon parallelization.
One option is to implement a single Markov chain on a master processor, and to distribute
the moves and orchestrate the communications via a task farm model. This has been im-
plemented for a non-equilibrium Markov chain system by Shu [6], in an overall approach
that is closely related to our own. Before adopting the task-farm approach, it is important
to consider that in a computer simulation like this, the Markov chain will be generated
using a pseudo random number generator. The deterministic nature of the RNG means
that this type of orchestration can be achieved without a conductor. Each processor can
run individual copies of the same RNG, initialised with the same seed, and so generate the
entire Markov chain of candidate moves. Therefore, each processor can determine its own
role and that of the neighbouring processors, without any actual communication between
the processors. This means that the only communication required for orchestration is just
the initial distribution of the random number seed for the global Markov chain and, for
off-lattice systems, infrequent updates to some look-up tables (these tables will be described
in the next section).
7
and thus the processor domain surface area, is increased, and so allows the communication
work to be neatly overlapped with the local computations. For the sake of simplicity, we
consider a one-dimensional domain decomposition here (as shown in Figure 2), although this
approach is suitable for any spatial domain decomposition. Each processor will manage an
equal volume of the overall system, sliced up along one axis. This local and global simulation
structure will be generalised to off-lattice systems in section III.
B. Implicit Orchestration
At first sight, exploiting this inherent parallelism seems impossible to do efficiently. Each
process must be informed of the sequence of moves, and if a neighbouring processor is mov-
ing a particle that is in the halo, the processor must await its turn. This seems to imply a
requirement of at least one message sent and one received for each processor, for each and
every particle move. As communications are usually an order of magnitude slower than sim-
ple local computations [???], this would lead to extremely poor performance. This problem
of orchestrating the simulation is the core issue that must be faced upon parallelization.
One option is to implement a single Markov chain on a master processor, and to distribute
the moves and orchestrate the communications via a task farm model. This has been im-
plemented for a non-equilibrium Markov chain system by Shu [6], in an overall approach
that is closely related to our own. Before adopting the task-farm approach, it is important
to consider that in a computer simulation like this, the Markov chain will be generated
using a pseudo random number generator. The deterministic nature of the RNG means
that this type of orchestration can be achieved without a conductor. Each processor can
run individual copies of the same RNG, initialised with the same seed, and so generate the
entire Markov chain of candidate moves. Therefore, each processor can determine its own
role and that of the neighbouring processors, without any actual communication between
the processors. This means that the only communication required for orchestration is just
the initial distribution of the random number seed for the global Markov chain and, for
off-lattice systems, infrequent updates to some look-up tables (these tables will be described
in the next section).
7
Page 8
Generate a new position.
Calculate ∆Ε.
Accept/reject.
Pick a particle at random.
Look upowner PE
If this PE owns the current particle. If this is a halo particle owned by a neighbour PE. Otherwise.
Receive updates for any neighbouring halo cells.
If any neighbouring cells are halos,send position update to neighbour PE.
Store ID of update message forthe affected halo cells.
Figure 3: Parallel implementation of the Monte Carlo algorithm. Each processor executes the same
workflow, and generates the whole global Markov chain, but only acts upon a sub-set of it. The
arrows indicate the necessary messages required for the simulation to keep the halo state up to
date. The overall orchestration requires minimal messaging, as duplicated RNG streams can be
used to maintain coherence implicitly.
C. Halo State Message Passing Scheme
While the sequence of trial moves can be distributed without communication, the rest
of the computation cannot, as this depends on the state of the system and the halo state
must be kept up to date. However, as the implicit orchestration allows each processor to
know when its neighbours may have altered a particle that is in the processors halo region,
the necessary sequence of update-messages can be generated from the global Markov chain.
This aspect of the simulation is quite complex, and the overall logic is illustrated in Figure
3.
8
Calculate ∆Ε.
Accept/reject.
Pick a particle at random.
Look upowner PE
If this PE owns the current particle. If this is a halo particle owned by a neighbour PE. Otherwise.
Receive updates for any neighbouring halo cells.
If any neighbouring cells are halos,send position update to neighbour PE.
Store ID of update message forthe affected halo cells.
Figure 3: Parallel implementation of the Monte Carlo algorithm. Each processor executes the same
workflow, and generates the whole global Markov chain, but only acts upon a sub-set of it. The
arrows indicate the necessary messages required for the simulation to keep the halo state up to
date. The overall orchestration requires minimal messaging, as duplicated RNG streams can be
used to maintain coherence implicitly.
C. Halo State Message Passing Scheme
While the sequence of trial moves can be distributed without communication, the rest
of the computation cannot, as this depends on the state of the system and the halo state
must be kept up to date. However, as the implicit orchestration allows each processor to
know when its neighbours may have altered a particle that is in the processors halo region,
the necessary sequence of update-messages can be generated from the global Markov chain.
This aspect of the simulation is quite complex, and the overall logic is illustrated in Figure
3.
8
Page 9
When a PE performs a move that requires an up-to-date halo, i.e. when the energy
calculation involves the halo cells, it first collects all the relevant halo-update messages (see
below) that have been issued up to that point. Once all the particles in the halo have been
brought up to date, the move continues as normal. After a processor attempts to update a
particle that belongs to a cell which forms part of the halo of another process, a halo-update
message is send to the dependant PE. This happens irrespective of whether or not the the
trial move was actually accepted. The message contains the identity of the particle that
has been ‘touched’ and its position after the update (which may be unchanged), and the
message is marked with a unique flag based on the current sweep time and process rank.
Every time a neighbouring PE moves a particle that lies in the halo, the corresponding halo-
state message flag generated and pushed onto a stack linked to that halo cell. Non-blocking
sends and receives are used throughout, to allow the communications to overlap with the
computations.
As the algorithm relies only on the local change in energy, there is no running total for
the energy of the entire system, only for the total energy within each processor domain. The
simulation periodically determines the current total system energy via a global operation
that is carried out once every few MCS, at which point the processors will be synchronized
to the same point in the Markov chain.
This scheme optimally utilizes the natural parallelism present in the Markov chain, and
overlaps communications with computations as much as is possible while still preserving the
global Markov sequence. Only the fraction of time when sequential moves involve nearest
neighbour cells that lie on a decomposition boundary will be spent processing moves serially.
In all other cases, the communications will be overlapped with the computation. As indicated
above, this is similar to the approach that Shu [6] developed for a non-equilibrium deposition
process, but we employ a single global Markov chain, instead of having a separate Markov
chain for each PE, and the implicit orchestration avoids the need to employ a task farm.
To facilitate this parallelization scheme, the algorithm does acquire some significant base-
line costs over the serial form. Running a copy of the global Markov chain takes time, but
this cost is minimal as it only generates the particle identities for the trial moves. The cost
does become more severe as the number of processors is increased, and the fraction of moves
that concern each individual processor decreases, but this is unlikely to become a significant
performance problem for all the the simplest models. The rest of the computations, like the
9
calculation involves the halo cells, it first collects all the relevant halo-update messages (see
below) that have been issued up to that point. Once all the particles in the halo have been
brought up to date, the move continues as normal. After a processor attempts to update a
particle that belongs to a cell which forms part of the halo of another process, a halo-update
message is send to the dependant PE. This happens irrespective of whether or not the the
trial move was actually accepted. The message contains the identity of the particle that
has been ‘touched’ and its position after the update (which may be unchanged), and the
message is marked with a unique flag based on the current sweep time and process rank.
Every time a neighbouring PE moves a particle that lies in the halo, the corresponding halo-
state message flag generated and pushed onto a stack linked to that halo cell. Non-blocking
sends and receives are used throughout, to allow the communications to overlap with the
computations.
As the algorithm relies only on the local change in energy, there is no running total for
the energy of the entire system, only for the total energy within each processor domain. The
simulation periodically determines the current total system energy via a global operation
that is carried out once every few MCS, at which point the processors will be synchronized
to the same point in the Markov chain.
This scheme optimally utilizes the natural parallelism present in the Markov chain, and
overlaps communications with computations as much as is possible while still preserving the
global Markov sequence. Only the fraction of time when sequential moves involve nearest
neighbour cells that lie on a decomposition boundary will be spent processing moves serially.
In all other cases, the communications will be overlapped with the computation. As indicated
above, this is similar to the approach that Shu [6] developed for a non-equilibrium deposition
process, but we employ a single global Markov chain, instead of having a separate Markov
chain for each PE, and the implicit orchestration avoids the need to employ a task farm.
To facilitate this parallelization scheme, the algorithm does acquire some significant base-
line costs over the serial form. Running a copy of the global Markov chain takes time, but
this cost is minimal as it only generates the particle identities for the trial moves. The cost
does become more severe as the number of processors is increased, and the fraction of moves
that concern each individual processor decreases, but this is unlikely to become a significant
performance problem for all the the simplest models. The rest of the computations, like the
9
Page 10
generation of the actual trial move, are carried out locally using a separate RNG stream.
Also, each PE must hold some look-up tables on each processor so that the owning PE and
the affected halo cells can be inferred from the identity of the particles. If the particles
of the system are free to move between cells, as they will be for off-lattice systems, then
these tables must be updated periodically. The precise frequency of updates depends on
the system in question, but will generally be infrequent in comparison to the movement of
individual particles, and thus an acceptable overhead. If the system has fixed degrees of
freedom, like the Ising model, then these tables never need to be updated and this overhead
is not present.
The code also runs periodic self-tests, to ensure that the global RNG streams are still
synchronised, that the neighbour lists and look-up tables are consistent. Measurements of
system properties, including the total energy, are also most easily implemented after these
checks.
III. MODELLING THE HARD SPHERE FLUID
We wish to model an array of N particles, each at a location ~ri in a three-dimensional
space, under periodic boundary conditions, interacting via a short-range configurational
energy such that
H({~r}) =
0 if rij ≥ σ ∀ i, j,
∞ otherwise,
(2)
where rij = |~ri − ~rj| and σ is the diameter of the spheres. The size of the simulation cell
remains fixed throughout the simulation, and this volume is determined by specifying the
required volume fraction, via
φ =
Npiσ3
6V
. (3)
Trial moves are generated by choosing a sphere at random, and then altering the position
of the chosen sphere by a small random vector where each component is drawn from a
uniform random deviate scaled to a size ∆r.
As mentioned above, due to the complexities involved in implementing this kind of parallel
algorithm for the first time, a one-dimensional domain decomposition was used. To allow
10
Also, each PE must hold some look-up tables on each processor so that the owning PE and
the affected halo cells can be inferred from the identity of the particles. If the particles
of the system are free to move between cells, as they will be for off-lattice systems, then
these tables must be updated periodically. The precise frequency of updates depends on
the system in question, but will generally be infrequent in comparison to the movement of
individual particles, and thus an acceptable overhead. If the system has fixed degrees of
freedom, like the Ising model, then these tables never need to be updated and this overhead
is not present.
The code also runs periodic self-tests, to ensure that the global RNG streams are still
synchronised, that the neighbour lists and look-up tables are consistent. Measurements of
system properties, including the total energy, are also most easily implemented after these
checks.
III. MODELLING THE HARD SPHERE FLUID
We wish to model an array of N particles, each at a location ~ri in a three-dimensional
space, under periodic boundary conditions, interacting via a short-range configurational
energy such that
H({~r}) =
0 if rij ≥ σ ∀ i, j,
∞ otherwise,
(2)
where rij = |~ri − ~rj| and σ is the diameter of the spheres. The size of the simulation cell
remains fixed throughout the simulation, and this volume is determined by specifying the
required volume fraction, via
φ =
Npiσ3
6V
. (3)
Trial moves are generated by choosing a sphere at random, and then altering the position
of the chosen sphere by a small random vector where each component is drawn from a
uniform random deviate scaled to a size ∆r.
As mentioned above, due to the complexities involved in implementing this kind of parallel
algorithm for the first time, a one-dimensional domain decomposition was used. To allow
10
Page 11
the construction of halos and sensible neighbour lists (thus avoiding the O(N2) calculation
implied by eq. 2), we use a standard particle-in-cell neighbour list system, where the cell
is at least as large as the range of interaction. The overall system is broken down into a
regular grid of Cartesian cells of side length Lm. The interaction energy of each particle is
only calculated for neighbours within that particles cell, and within the neighbouring cells.
To ensure that there are an integer number of cells in each direction, the size of the cell is
allowed to grow slightly in order to satisfy the the specified volume.
As the particles are free to move between cells, the list of particle belonging to each cell
must be updated from time-to-time, and this is done as infrequently as possible, while still
ensuring that no particle interactions are missed. We employ a conservative re-meshing time,
based on the time (in MCS) it would take a single sphere to diffuse across the cell length, if
it is selected for a trial move once per sweep, and always moved in the same direction:
Tmesh =
Lm − σ
∆r
. (4)
In all the work presented here, a mesh of size Lm = 3σ was employed, and most simu-
lations used a step-size of ∆r = 0.1σ, implying a re-meshing time of Tmesh = 20 MCS. The
running-totals for the energy on each domain were occasionally tested by computing the
energy from scratch after the mesh has been updated. Any problems with the meshes not
being up to date would be caught at this point, and this was never found to occur.
To simplify the management of the processor halos, each processor is allocated a domain
that is no fewer than three cells thick along the decomposition access. This ensure that no
halo particles are in the halo of more than one other processor, and thus allows the algorithm
to be simplified. Of course, this also places an overly restrictive upper limit on the number
of processes that can be used on a simulation of a given size.
A. Initialising the Hard Sphere Fluid
In order to initialization a system of hard spheres, we have developed a new Monte
Carlo scheme that allows the desired volume fraction to be easily specified, and that does
not involve any expensive global moves, e.g. increasing the diameter of the particles [???].
The particle centres are initially placed at random throughout the system, ignoring all
interactions. This will usually create large numbers of overlapping particles, and these must
11
implied by eq. 2), we use a standard particle-in-cell neighbour list system, where the cell
is at least as large as the range of interaction. The overall system is broken down into a
regular grid of Cartesian cells of side length Lm. The interaction energy of each particle is
only calculated for neighbours within that particles cell, and within the neighbouring cells.
To ensure that there are an integer number of cells in each direction, the size of the cell is
allowed to grow slightly in order to satisfy the the specified volume.
As the particles are free to move between cells, the list of particle belonging to each cell
must be updated from time-to-time, and this is done as infrequently as possible, while still
ensuring that no particle interactions are missed. We employ a conservative re-meshing time,
based on the time (in MCS) it would take a single sphere to diffuse across the cell length, if
it is selected for a trial move once per sweep, and always moved in the same direction:
Tmesh =
Lm − σ
∆r
. (4)
In all the work presented here, a mesh of size Lm = 3σ was employed, and most simu-
lations used a step-size of ∆r = 0.1σ, implying a re-meshing time of Tmesh = 20 MCS. The
running-totals for the energy on each domain were occasionally tested by computing the
energy from scratch after the mesh has been updated. Any problems with the meshes not
being up to date would be caught at this point, and this was never found to occur.
To simplify the management of the processor halos, each processor is allocated a domain
that is no fewer than three cells thick along the decomposition access. This ensure that no
halo particles are in the halo of more than one other processor, and thus allows the algorithm
to be simplified. Of course, this also places an overly restrictive upper limit on the number
of processes that can be used on a simulation of a given size.
A. Initialising the Hard Sphere Fluid
In order to initialization a system of hard spheres, we have developed a new Monte
Carlo scheme that allows the desired volume fraction to be easily specified, and that does
not involve any expensive global moves, e.g. increasing the diameter of the particles [???].
The particle centres are initially placed at random throughout the system, ignoring all
interactions. This will usually create large numbers of overlapping particles, and these must
11
Page 12
0 0.5 1 1.5 2 2.5rij/σ
0
0.5
1
1.5
2
H [kT
]
HSOP
Figure 4: The hard-sphere potential (HS) and overlap potential (OP).
be removed in order to reach an equilibrated fluid. To do this we use a secondary overlap
potential, the calculation of which can be piggy-backed on the true energy calculation. This
artificial potential has the form,
HOP ({~r}) =
2−
r2ij
σ2 if rij < σ ∀ i, j,
0 otherwise,
(5)
and is shown in Figure 4, along with the hard-sphere potential. This provides a very
useful order parameter that measures not only the number of overlapping particles, but
reflects the degree to which they overlap.
For this artificial potential, we chose to use a simple steepest-descent algorithm, so that
any local move that will maintain or lower the value of the overlap Hamiltonian is accepted,
while all other moves are rejected. If the local overlap Hamiltonian is zero, then the true
Hamiltonian is used instead, and the local simulation proceeds under the usual Metropolis
algorithm. Although we did not attempt it here, wary that it may encourage crystalliza-
tion, we note that an annealing algorithm could be employed instead of steepest-descent if
required. Also, the form of the overlap potential was chosen somewhat arbitrarily, simply
because it is easily computed from the square of the inter-particle separation, which is al-
ready determined in the course of computing the true Hamiltonian. We did compared our
overlap potential with a simple, flat top-hat overlap potential ( HOP = 1.0 between rij = 0.0
and rij = σ, otherwise HOP = 0.0), which simply counts the number of overlaps in the
12
0
0.5
1
1.5
2
H [kT
]
HSOP
Figure 4: The hard-sphere potential (HS) and overlap potential (OP).
be removed in order to reach an equilibrated fluid. To do this we use a secondary overlap
potential, the calculation of which can be piggy-backed on the true energy calculation. This
artificial potential has the form,
HOP ({~r}) =
2−
r2ij
σ2 if rij < σ ∀ i, j,
0 otherwise,
(5)
and is shown in Figure 4, along with the hard-sphere potential. This provides a very
useful order parameter that measures not only the number of overlapping particles, but
reflects the degree to which they overlap.
For this artificial potential, we chose to use a simple steepest-descent algorithm, so that
any local move that will maintain or lower the value of the overlap Hamiltonian is accepted,
while all other moves are rejected. If the local overlap Hamiltonian is zero, then the true
Hamiltonian is used instead, and the local simulation proceeds under the usual Metropolis
algorithm. Although we did not attempt it here, wary that it may encourage crystalliza-
tion, we note that an annealing algorithm could be employed instead of steepest-descent if
required. Also, the form of the overlap potential was chosen somewhat arbitrarily, simply
because it is easily computed from the square of the inter-particle separation, which is al-
ready determined in the course of computing the true Hamiltonian. We did compared our
overlap potential with a simple, flat top-hat overlap potential ( HOP = 1.0 between rij = 0.0
and rij = σ, otherwise HOP = 0.0), which simply counts the number of overlaps in the
12
Page 13
system, but found the run-times required to push out all of the overlaps were very large at
higher densities. Other forms may also work well, but are unlikely to provide huge gains
over the simple quadratic form given here.
The total overlap order parameter can be determined along with the total energy, and once
all the overlaps have been equilibrated out of the system (after a time referred to herein as
tOP ) the simulation proceeds under the normal Hamiltonian and meaningful measurements
can be performed upon the system. For example, to examine the type of structures the
algorithm is generating, we measure the particle pair distribution function, g(r) [???].
IV. RESULTS
The algorithm was implemented in MPI, version two, using the C++ API [???]. As the
algorithm requires large numbers of small messages, some implementations of MPI requried
large message buffers to be specified manually in order to ensure that the simulation did not
hang while waiting for messages to be sent, due to the large queues of messages awaiting
reciept. While a range of platforms were employed during development, all production runs
were performed on the University of Edinburgh’s Blue Sky BlueGene/L super-computing
service [???].
A. Benchmarking
The parallel speed-up is defined as,
S =
T (1)
T (NP )
, (6)
where T (NP ) is the run time required to execute the program on NP processors, with
the simulation size and all other parameters fixed. The results are shown in Figure 5 for a
system of 100,000 hard spheres at a volume fraction of φ = 0.1, for both a cubic and a cuboid
system. Overall, the scaling is good, and somewhat better than in the related simulations by
Shu [6], probably because the need for the task farm has been circumvented by the implict
orchestration scheme. The scaling is not as good as the near-perfect scaling of Heffelfinger
[Heff1995], but this is the price that must be paid for maintaining the integrity of the global
Markov chain.
13
higher densities. Other forms may also work well, but are unlikely to provide huge gains
over the simple quadratic form given here.
The total overlap order parameter can be determined along with the total energy, and once
all the overlaps have been equilibrated out of the system (after a time referred to herein as
tOP ) the simulation proceeds under the normal Hamiltonian and meaningful measurements
can be performed upon the system. For example, to examine the type of structures the
algorithm is generating, we measure the particle pair distribution function, g(r) [???].
IV. RESULTS
The algorithm was implemented in MPI, version two, using the C++ API [???]. As the
algorithm requires large numbers of small messages, some implementations of MPI requried
large message buffers to be specified manually in order to ensure that the simulation did not
hang while waiting for messages to be sent, due to the large queues of messages awaiting
reciept. While a range of platforms were employed during development, all production runs
were performed on the University of Edinburgh’s Blue Sky BlueGene/L super-computing
service [???].
A. Benchmarking
The parallel speed-up is defined as,
S =
T (1)
T (NP )
, (6)
where T (NP ) is the run time required to execute the program on NP processors, with
the simulation size and all other parameters fixed. The results are shown in Figure 5 for a
system of 100,000 hard spheres at a volume fraction of φ = 0.1, for both a cubic and a cuboid
system. Overall, the scaling is good, and somewhat better than in the related simulations by
Shu [6], probably because the need for the task farm has been circumvented by the implict
orchestration scheme. The scaling is not as good as the near-perfect scaling of Heffelfinger
[Heff1995], but this is the price that must be paid for maintaining the integrity of the global
Markov chain.
13
Page 14
1 5 10 15 20 25NP
1
5
10
15
20
25
S
IdealCube N=100,000Long N=100,000
Figure 5: Parallel speed-up for a cubic and cuboid system of hard spheres at a volume fraction of
0.1. The cuboid system was constructed at the same parameters as the cubic one, but was twice
as long along the domain decomposition axis, and correspondingly narrower across that axis. Both
curves display saw-tooth fluctuation due to the domain size changing while the number of cells on
each domain remain fixed.
Although there is a small drop when comparing the serial code against the parallel code
on three processors, due to the overall overheads involved in the parallel algorithm, the
initial scaling for small numbers of processors is good. In particular, the simulation runs
very nearly twice as fast on six processors as it does on three. However, as the number of
processors goes up, the domain on each processor gets thinner, and the ratio of the domain
surface to the volume goes up. Therefore the ratio of communication work to computation
work goes up, and the simulation slows down. Figure 5 also shows that for a longer, narrower
system, the scaling improves, confirming that the surface to volume ratio of the domains is
the main reason that the simulation to slows down. As the limits of the scaling speed-up are
largely due to the choice of one-dimensional domain decomposition, a cubic or face-centered
cubic decomposition will lead to even better scaling and should allow systems of around
107particles to be explored relatively easily.
14
1
5
10
15
20
25
S
IdealCube N=100,000Long N=100,000
Figure 5: Parallel speed-up for a cubic and cuboid system of hard spheres at a volume fraction of
0.1. The cuboid system was constructed at the same parameters as the cubic one, but was twice
as long along the domain decomposition axis, and correspondingly narrower across that axis. Both
curves display saw-tooth fluctuation due to the domain size changing while the number of cells on
each domain remain fixed.
Although there is a small drop when comparing the serial code against the parallel code
on three processors, due to the overall overheads involved in the parallel algorithm, the
initial scaling for small numbers of processors is good. In particular, the simulation runs
very nearly twice as fast on six processors as it does on three. However, as the number of
processors goes up, the domain on each processor gets thinner, and the ratio of the domain
surface to the volume goes up. Therefore the ratio of communication work to computation
work goes up, and the simulation slows down. Figure 5 also shows that for a longer, narrower
system, the scaling improves, confirming that the surface to volume ratio of the domains is
the main reason that the simulation to slows down. As the limits of the scaling speed-up are
largely due to the choice of one-dimensional domain decomposition, a cubic or face-centered
cubic decomposition will lead to even better scaling and should allow systems of around
107particles to be explored relatively easily.
14
Page 15
0.1 0.2 0.3 0.4 0.5 0.6φ
102
103
104
105
t OP
[MC
S]
N = 2,000N = 5,000,000
Figure 6: Time required to equilibrate the system so that the hard-core constraint is satisfied, as
a function of the target volume fraction. Open circles are data from N = 2, 000, and filled circles
are data from N = 5, 000, 000. Solid curve is the power-law fit – see main text for details.
B. The Hard Sphere Fluid
Figure 6 shows how the overlap equilibration time, tOP , depends on the volume fraction,
indicating that this time-scale only depends weakly upon N . The time required to remove
all overlaps also displays a clear divergence as the density approaches the random close-
packing (RCP) limit. We found the form of this divergence could be well approximated by
a power-law of the form:
tOP ≈ 25.14× |φ− 0.61|
−1.64 . (7)
This predicts an approximate RCP limit of φ = 0.61 for this algorithm (at N = 2, 000)
which is broadly consistent with the expected limit of ≈ 0.64 [???]. A more rigorous ex-
ploration of the RCP limit would required a more careful exploration of this divergence,
taking finite-size effects and other simulation parameters properly into account. Further-
more, we have done nothing to prohibit crystallization and so cannot ensure that the fluid
has not partially crystallised, although we have found no evidence of crystallization so far
15
102
103
104
105
t OP
[MC
S]
N = 2,000N = 5,000,000
Figure 6: Time required to equilibrate the system so that the hard-core constraint is satisfied, as
a function of the target volume fraction. Open circles are data from N = 2, 000, and filled circles
are data from N = 5, 000, 000. Solid curve is the power-law fit – see main text for details.
B. The Hard Sphere Fluid
Figure 6 shows how the overlap equilibration time, tOP , depends on the volume fraction,
indicating that this time-scale only depends weakly upon N . The time required to remove
all overlaps also displays a clear divergence as the density approaches the random close-
packing (RCP) limit. We found the form of this divergence could be well approximated by
a power-law of the form:
tOP ≈ 25.14× |φ− 0.61|
−1.64 . (7)
This predicts an approximate RCP limit of φ = 0.61 for this algorithm (at N = 2, 000)
which is broadly consistent with the expected limit of ≈ 0.64 [???]. A more rigorous ex-
ploration of the RCP limit would required a more careful exploration of this divergence,
taking finite-size effects and other simulation parameters properly into account. Further-
more, we have done nothing to prohibit crystallization and so cannot ensure that the fluid
has not partially crystallised, although we have found no evidence of crystallization so far
15
Page 16
(see below).
Although most runs used a random walk step size of ∆r = 0.1σ, some runs were also
carried out using ∆r = 0.01σ. Although this does improve the acceptance rate at higher
densities, the overall effect on tOP was found to be quite minor, as the overall divergence is
determined mainly by packing effects. When optimizing simulation performance, ∆r should
be modified to ensure the most efficient exploration of phase space is carried out, scaling the
step size down as the mean particle spacing falls to zero.
Figure 7 shows the radial distribution for the hard sphere fluid both below and above
the fluid-crystal co-existance region (φ ≈ 0.45 − 0.55). At a moderate volume fraction of
φ = 0.4, the plot shows g(r) at t = 0 MCS (initial ideal-gas distribution), at t = tOP = 860
MCS (immediately after the overlaps have been removed), and at t = 2000 MCS (i.e. a
significant amount of time beyond tOP ). Clearly, the fluid is already close to the equilibrium
particle distribution once the overlaps have been removed, and no further evolution of the
pair distribution could be detected. However, g(r) is not very sensitive to local order [???],
and more detailed measurements are required in order to fully gauge equilibration.
As we are hoping to explore the RCP limit, we also looked for crystallization at higher
densities. Figure 7 also shows the measured g(r) for a hard sphere fluid at a much higher
density of φ = 0.55. The splitting of the second peak is consistent with a dense amorphous
state [???], and there are no hints of close-packing features. However, as for the fluid, a
closer examination is required, perhaps using a more sensitive local order-parameter that
would allow any crystalline domains to be identified explicitly [???, GJA].
V. DISCUSSION
We have presented a new domain-decomposition algorithm for Monte Carlo simulations,
applicable to any short-ranged interaction. The approach does not alter the global Markov
chain in any way, meaning that the random sequence of particles selected for trial moves can
be reproduced on any number of processors. Duplicates of the global Markov chain, gener-
ated from synchronised RNG streams, are used to orchestrate the simualtion while requiring
only minimal explicit communication. The algorithm was implemented via a one-dimensional
domain decomposition, and applied to a system of hard spheres. The benchmarking results
confirm that the simulation is capable of scaling well, and indicates that a more traditional
16
Although most runs used a random walk step size of ∆r = 0.1σ, some runs were also
carried out using ∆r = 0.01σ. Although this does improve the acceptance rate at higher
densities, the overall effect on tOP was found to be quite minor, as the overall divergence is
determined mainly by packing effects. When optimizing simulation performance, ∆r should
be modified to ensure the most efficient exploration of phase space is carried out, scaling the
step size down as the mean particle spacing falls to zero.
Figure 7 shows the radial distribution for the hard sphere fluid both below and above
the fluid-crystal co-existance region (φ ≈ 0.45 − 0.55). At a moderate volume fraction of
φ = 0.4, the plot shows g(r) at t = 0 MCS (initial ideal-gas distribution), at t = tOP = 860
MCS (immediately after the overlaps have been removed), and at t = 2000 MCS (i.e. a
significant amount of time beyond tOP ). Clearly, the fluid is already close to the equilibrium
particle distribution once the overlaps have been removed, and no further evolution of the
pair distribution could be detected. However, g(r) is not very sensitive to local order [???],
and more detailed measurements are required in order to fully gauge equilibration.
As we are hoping to explore the RCP limit, we also looked for crystallization at higher
densities. Figure 7 also shows the measured g(r) for a hard sphere fluid at a much higher
density of φ = 0.55. The splitting of the second peak is consistent with a dense amorphous
state [???], and there are no hints of close-packing features. However, as for the fluid, a
closer examination is required, perhaps using a more sensitive local order-parameter that
would allow any crystalline domains to be identified explicitly [???, GJA].
V. DISCUSSION
We have presented a new domain-decomposition algorithm for Monte Carlo simulations,
applicable to any short-ranged interaction. The approach does not alter the global Markov
chain in any way, meaning that the random sequence of particles selected for trial moves can
be reproduced on any number of processors. Duplicates of the global Markov chain, gener-
ated from synchronised RNG streams, are used to orchestrate the simualtion while requiring
only minimal explicit communication. The algorithm was implemented via a one-dimensional
domain decomposition, and applied to a system of hard spheres. The benchmarking results
confirm that the simulation is capable of scaling well, and indicates that a more traditional
16
Page 17
0 1 2 3
r/σ
0
2
4
6
8
g(r)
φ = 0.4, t = 0φ = 0.4, t = 860φ = 0.4, t = 2000φ = 0.55
Figure 7: Radial distribution function, g(r). Results for φ = 0.4 are instantaneous spatially-
averaged results from a simulation of N = 5, 000, 000 hard spheres. The results at φ = 0.55 are
taken from a smaller, N = 500, 000 simulation. In this case, as well as the spatial averaging, the
results for g(r) were also time-averaged, from samples taken every 100 MCS from t = tOP = 4800
MCS to t = 10, 000 MCS. The values at contact, g(σ), are consistent with those in the literature
for the hard-sphere fluid [???,???].
cubic domain decomposition will lead to excellent scaling for very large systems of parti-
cles. The parallel implementation does acquire some overheads in comparison to the serial
version, but these are not severe enough to cause poor scaling performance overall.
We have also presented a new algorithm for generating dense hard-sphere fluids, and that
is well suited to Monte Carlo simulation in general, and to this parallelisation scheme in
particular. The algorithm has been shown to efficiently generate dense random packings
along the metastable branch. No evidence of crystallization has been found, but a more
detailed analysis is required to ensure that this is the case.
Future work involves extending the code to allow more physical insight to be gleaned from
it. As mentioned above, a better local order parameter could be used to identify regions
of close-packed or icosahedral order [???,GJA]. The effective diffusion of the particles could
also be measured easily, and could be used to probe caging phenomena as the density is
17
r/σ
0
2
4
6
8
g(r)
φ = 0.4, t = 0φ = 0.4, t = 860φ = 0.4, t = 2000φ = 0.55
Figure 7: Radial distribution function, g(r). Results for φ = 0.4 are instantaneous spatially-
averaged results from a simulation of N = 5, 000, 000 hard spheres. The results at φ = 0.55 are
taken from a smaller, N = 500, 000 simulation. In this case, as well as the spatial averaging, the
results for g(r) were also time-averaged, from samples taken every 100 MCS from t = tOP = 4800
MCS to t = 10, 000 MCS. The values at contact, g(σ), are consistent with those in the literature
for the hard-sphere fluid [???,???].
cubic domain decomposition will lead to excellent scaling for very large systems of parti-
cles. The parallel implementation does acquire some overheads in comparison to the serial
version, but these are not severe enough to cause poor scaling performance overall.
We have also presented a new algorithm for generating dense hard-sphere fluids, and that
is well suited to Monte Carlo simulation in general, and to this parallelisation scheme in
particular. The algorithm has been shown to efficiently generate dense random packings
along the metastable branch. No evidence of crystallization has been found, but a more
detailed analysis is required to ensure that this is the case.
Future work involves extending the code to allow more physical insight to be gleaned from
it. As mentioned above, a better local order parameter could be used to identify regions
of close-packed or icosahedral order [???,GJA]. The effective diffusion of the particles could
also be measured easily, and could be used to probe caging phenomena as the density is
17
Page 18
increased. The code can also be generalised to other interesting short-range potentials, and
well suited to studies of phenomena that require large simulation cells, such as the dynamics
of the interfaces between phases.
We have only considered the canonical ensemble, but extension to other thermodynamic
ensembles should be possible. For example, when modelling constant-pressure conditions,
a regular global update is required in order to attempt volume dilations [???]. This would
be easy to fit into the parallelization scheme presented here, implemented via a distributed
energy calculation that would synchronise the simulation in the same way as the current
global energy calculation.
We also note that this approach is not restricted to equilibrium simulation, as any Markov
process with only local dependencies on the acceptance of moves will yield to the same decom-
position. For example, by simply re-interpreting this hard-sphere Monte Carlo simulation
as a model of Brownian dynamics in the non-inertial (Stokesian) limit, we can consider the
current implementation to be a model of colloidal dynamics [???]. This could be applied, for
example, to direct simulation of gelation and gel aging, modelled as hard-spheres interacting
via a depletion potential. Furthermore, assuming that the effect of shear can be modelled as
bulk force that biases generation of trial moves, the framework could even be used to study
shear-induced phenomena [???].
Acknowledgments
This work was supported by [Grants? Ask Mike.] A.N.J. would like to thank M.E. Cates,
K. Stratford, D. Frenkel and E.F. MacDougall for useful discussions and feedback.
[1] Grant Heffelfinger. Parallel atomistic simulations. Computer Physics Communications, 128(1-
2):219–237, Jun 2000.
[2] A.E Brockwell. Parallel markov chain monte carlo simulation by pre-fetching. Journal of
Computational & Graphical Statistics, 15(1):246–261, Mar 2006.
[3] Uhlherr A. Parallel monte carlo simulations by asynchronous domain decomposition. Computer
Physics Communications, 155:31–41, 0.
18
well suited to studies of phenomena that require large simulation cells, such as the dynamics
of the interfaces between phases.
We have only considered the canonical ensemble, but extension to other thermodynamic
ensembles should be possible. For example, when modelling constant-pressure conditions,
a regular global update is required in order to attempt volume dilations [???]. This would
be easy to fit into the parallelization scheme presented here, implemented via a distributed
energy calculation that would synchronise the simulation in the same way as the current
global energy calculation.
We also note that this approach is not restricted to equilibrium simulation, as any Markov
process with only local dependencies on the acceptance of moves will yield to the same decom-
position. For example, by simply re-interpreting this hard-sphere Monte Carlo simulation
as a model of Brownian dynamics in the non-inertial (Stokesian) limit, we can consider the
current implementation to be a model of colloidal dynamics [???]. This could be applied, for
example, to direct simulation of gelation and gel aging, modelled as hard-spheres interacting
via a depletion potential. Furthermore, assuming that the effect of shear can be modelled as
bulk force that biases generation of trial moves, the framework could even be used to study
shear-induced phenomena [???].
Acknowledgments
This work was supported by [Grants? Ask Mike.] A.N.J. would like to thank M.E. Cates,
K. Stratford, D. Frenkel and E.F. MacDougall for useful discussions and feedback.
[1] Grant Heffelfinger. Parallel atomistic simulations. Computer Physics Communications, 128(1-
2):219–237, Jun 2000.
[2] A.E Brockwell. Parallel markov chain monte carlo simulation by pre-fetching. Journal of
Computational & Graphical Statistics, 15(1):246–261, Mar 2006.
[3] Uhlherr A. Parallel monte carlo simulations by asynchronous domain decomposition. Computer
Physics Communications, 155:31–41, 0.
18
Page 19
[4] A Uhlherr, SJ Leak, NE Adam, PE Nyberg, M Doxastakis, VG Mavrantzas, and
DN Theodorou. Large scale atomistic polymer simulations using monte carlo methods for
parallel vector processors. Computer Physics Communications, 144(1):1–22, 2002.
[5] Nicholas Metropolis, Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller, and Edward
Teller. Equation of state calculations by fast computing machines. J. Chem. Phys., 21(6):1087–
1092, Jun 1953.
[6] JW Shu, Q Lu, WO Wong, and HC Huang. Parallelization strategies for monte carlo simulations
of thin film deposition. Computer Physics Communications, 144(1):34–45, 2002.
19
DN Theodorou. Large scale atomistic polymer simulations using monte carlo methods for
parallel vector processors. Computer Physics Communications, 144(1):1–22, 2002.
[5] Nicholas Metropolis, Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller, and Edward
Teller. Equation of state calculations by fast computing machines. J. Chem. Phys., 21(6):1087–
1092, Jun 1953.
[6] JW Shu, Q Lu, WO Wong, and HC Huang. Parallelization strategies for monte carlo simulations
of thin film deposition. Computer Physics Communications, 144(1):34–45, 2002.
19
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
1 Reader on Mendeley
by Discipline
by Academic Status
100% Researcher (at a non-Academic Institution)
by Country
100% United Kingdom



