Sign up & Download
Sign in

I Do Declare : Consensus in a Logic Language

by Peter Alvaro, Tyson Condie, Neil Conway, Joseph M Hellerstein, Russell Sears
Russell The Journal Of The Bertrand Russell Archives (2010)

Abstract

The Paxos consensus protocol can be specified concisely, but is notoriously difficult to implement in practice. We recount our experience building Paxos in Overlog, a distributed declarative programming language. We found that the Paxos algorithm is easily translated to declarative logic, in large part because the primitives used in consensus protocol specifications map directly to simple Overlog constructs such as aggregation and selection. We discuss the programming idioms that appear frequently in our implementation, and the applicability of declarative programming to related application domains.

Cite this document (BETA)

Available from portal.acm.org
Page 1
hidden

I Do Declare : Consensus in a Logic Language

I Do Declare: Consensus in a Logic Language
Peter Alvaro Tyson Condie Neil Conway
Joseph M. Hellerstein Russell Sears
{palvaro,tcondie,nrc,hellerstein,sears}@cs.berkeley.edu
UC Berkeley
ABSTRACT
The Paxos consensus protocol can be speci ed concisely,
but is notoriously dicult to implement in practice. We
recount our experience building Paxos in Overlog, a dis-
tributed declarative programming language. We found that
the Paxos algorithm is easily translated to declarative logic,
in large part because the primitives used in consensus proto-
col speci cations map directly to simple Overlog constructs
such as aggregation and selection. We discuss the program-
ming idioms that appear frequently in our implementation,
and the applicability of declarative programming to related
application domains.
1. INTRODUCTION
Consensus protocols are a common building block for fault-
tolerant distributed systems [2]. Paxos is a widely-used con-
sensus protocol, rst described by Lamport [6, 7]. While
Paxos is conceptually simple, practical implementations are
dicult to achieve, and typically require thousands of lines
of carefully written code [1, 4, 9].
Much of this implementation diculty arises because high-
level protocol speci cations must be translated into low-level
imperative code, yielding a signi cant increase in program
size and complexity. In practical implementations of Paxos,
the simplicity of the consensus algorithm is obscured by com-
mon but often tricky details such as event loops, timer in-
terrupts, explicit concurrency, and the serialization and per-
sistence of data structures.
By contrast, consensus protocols such as two-phase com-
mit and Paxos are speci ed in the literature at a high level,
in terms of messages, invariants, and state machine transi-
tions. Overlog supports each of these concepts directly. By
using a declarative language to implement consensus pro-
Jim Gray observed about the two-phase commit protocol:
\It is very similar to the wedding ceremony in which the
minister asks `Do you?' and the participants say `I do' (or
`No way!') and then the minister says `I now pronounce
you,' or `The deal is o ."' [3]
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
NetDB ’09 Big Sky, Montana USA
Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$10.00.
tocols, we hoped to achieve a more concise implementation
that is conceptually closer to the original protocol speci -
cation. We discuss our Paxos implementation below, and
describe how we mapped concepts from the Paxos literature
into executable Overlog code.1 We re
ect on the design
patterns that we discovered while building this classical dis-
tributed service in a declarative language. The process of
identifying these patterns helped us better understand why
a declarative networking language is well-suited to program-
ming distributed systems. It has also clari ed our thinking
about the more general challenge of designing a language for
distributed computing.
Earlier literature described the modular decomposition of
the Paxos protocol in terms of Timed I/O Automata [12].
Our approach di ers in its grounding in database and logic
languages rather than explicit representations of state ma-
chines, and in its aim to produce executable code rather
than to model abstract systems. Szekely and Torres imple-
mented the Synod protocol (the kernel of Paxos) in Over-
log [14]. However, that work did not address important de-
tails such as Multipaxos, log replication, reconciliation, and
leader election. Here, we describe a complete Paxos imple-
mentation that addresses these issues.
1.1 Overlog
Overlog is a logic language based on Datalog. Datalog
programs consist of rules that take the form:
head(A, C) :- clause1(A, B), clause2(B, C);
where head, clause1, and clause2 are relations, \:-" de-
notes implication (() and \," denotes conjunction. A rule
may have any number of clauses, but only a single head.
Variables are denoted by identi ers that begin with an up-
percase letter, or by the symbol \_", which indicates that
the value of the variable will not be used in the rule. The
example rule ensures that the head relation contains a tu-
ple fA;Cg for each tuple fA;Bg in clause1 and fB;Cg in
clause2 where the tuples have the same value for B. It does
so by computing the join of clause1 and clause2 on B, and
projecting A and C. A Datalog program begins with some
base tuples, and derives new tuples by evaluating rules in
a bottom-up fashion (substituting tuples in the clause re-
lations to derive new tuples in the head relations) until no
more derivations can be made. Such a computation is called
a xpoint. A set of rules essentially expresses the constraint
1The Overlog source code for the Paxos implementation
we describe in this paper can be found at http://db.cs.
berkeley.edu/netdb-09/.
Page 2
hidden
/* Count number of peers */
peer_cnt(Coordinator, count<Peer>) :-
peers(Coordinator, Peer);
/* Count number of "yes" votes */
yes_cnt(Coordinator, TxnId, count<Peer>) :-
vote(Coordinator, TxnId, Peer, Vote),
Vote == "yes";
/* Prepare => Commit if unanimous */
transaction(Coordinator, TxnId, "commit") :-
peer_cnt(Coordinator, NumPeers),
yes_cnt(Coordinator, TxnId, NumYes),
transaction(Coordinator, TxnId, State),
NumPeers == NumYes, State == "prepare";
/* Prepare => Abort if any "no" votes */
transaction(Coordinator, TxnId, "abort") :-
vote(Coordinator, TxnId, _, Vote),
transaction(Coordinator, TxnId, State),
Vote == "no", State == "prepare";
/* All peers know transaction state */
transaction(@Peer, TxnId, State) :-
peers(@Coordinator, Peer),
transaction(@Coordinator, TxnId, State);
Figure 1: 2PC coordinator protocol in Overlog. The
DDL for transaction (not shown) speci es that the
rst two columns are a primary key.
that base facts and their transitive consequences will always
be consistent at xpoint.
Overlog computes a new xpoint whenever new tuples ar-
rive at a node. Overlog programs accept input from network
events, timers, and native methods, each of which may pro-
duce new tuples. Because evaluation of an Overlog program
proceeds in discrete time steps, rules may be interpreted as
invariants over state: the consistency of the rule speci ca-
tions will be true at every xpoint.
Network communication is expressed using a simple ex-
tension to the Datalog syntax:
recv_msg(@A, Payload) :-
send_msg(@B, Payload), peers(@B, A);
@ denotes the location speci er eld of a relation, which indi-
cates that the associated variables A and B contain network
addresses. A tuple moves between nodes if the address in
its location speci er is distinct from the address of the node
that deduced the tuple.
It is often useful to compute an aggregate over a set of
tuples, typically to choose an element of the set with a par-
ticular property (e.g. min, max) or to compute a summary
statistic over the set (e.g. count, sum). For example:
min_msg(min<SeqNum>) :-
queued_msgs(SeqNum, _);
de nes an aggregate relation that contains the smallest se-
quence number among the queued messages, and
next_msg(Payload) :-
queued_msgs(SeqNum, Payload),
min_msg(SeqNum);
states that the content of next_msg is the payload of the
queued message with the smallest sequence number. This
pair of rules is equivalent to the SQL statement:
/* Declare a timer that fires once per second */
timer(ticker, 1000ms);
/* Start counter when TxnId is in "prepare" state */
tick(Coordinator, TxnId, Count) :-
transaction(Coordinator, TxnId, State),
State == "prepare",
Count := 0;
/* Increment counter every second */
tick(Coordinator, TxnId, NewCount) :-
ticker(),
tick(Coordinator, TxnId, Count),
NewCount := Count + 1;
/* If not committed after 10 sec, abort TxnId */
transaction(Coordinator, TxnId, "abort") :-
tick(Coordinator, TxnId, Count),
transaction(Coordinator, TxnId, State),
Count > 10, State == "prepare";
Figure 2: Timeout-based abort. The rst two
columns of tick are a primary key.
SELECT payload FROM queued_msgs
WHERE seqnum =
(SELECT min(seqnum) FROM queued_msgs);
We encountered this pattern of selection over aggregation
frequently when implementing consensus protocols.
Finally, Overlog allows special timer relations to be de-
ned. The Overlog runtime inserts a tuple into each timer
relation at a user-de ned period, and the predicate holds
only at these intervals. Thus, joining against a timer rela-
tion allows for periodic evaluation of a rule.
2. TWO-PHASE COMMIT
Before tackling Paxos, we used Overlog to build two-phase
commit (2PC), a simple consensus protocol that decides on
a series of Boolean values (\commit" or \abort"). Unlike
Paxos, 2PC does not attempt to make progress in the face
of node failures.
Both Paxos and 2PC are based on rounds of messaging
and counting. In 2PC, the coordinator node communicates
the state of a transaction to the peer nodes. When the
transaction state transitions to \prepare" at a peer node, the
peer responds with a \yes" or \no" vote. The coordinator
counts these responses; if all peers respond \yes" then the
transaction commits. Otherwise it aborts. In terms of the
Overlog primitives described above, this is just messaging,
followed by a count aggregate, and a selection for the string
\no" in the peers' responses.
The mechanism that implements this protocol follows di-
rectly from the speci cation (Figure 1). The peer_cnt table
contains the coordinator address and the number of peers.
When vote messages arrive, the second and fourth rules are
considered. If the fourth rule is satis ed (with a single \no"
vote), the transaction state is updated to \abort"; otherwise,
yes_cnt is incremented to re
ect another positive vote for
this transaction. If yes_cnt equals peer_cnt, the vote is
unanimous and the transaction moves to the\commit"state.
The fth rule communicates changes to transaction state to
every peer node.
A practical 2PC implementation must address two addi-
tional details: timeouts and persistence. Timeouts allow the
coordinator to return an error if the peers take too long to

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

10 Readers on Mendeley
by Discipline
 
by Academic Status
 
30% Other Professional
 
30% Ph.D. Student
 
20% Researcher (at a non-Academic Institution)
by Country
 
60% United States
 
10% China
 
10% Mexico