Scalable Consistency in Scatter
Design (2011)
- ISBN: 9781450309776
- DOI: 10.1145/2043556.2043559
Available from dl.acm.org
or
Abstract
Distributed storage systems often trade off strong semantics for improved scalability. This paper describes the design, implementation, and evaluation of Scatter, a scalable and consistent distributed key-value storage system. Scatter adopts the highly decentralized and self-organizing structure of scalable peer-to-peer systems, while preserving linearizable consistency even under adverse circumstances. Our prototype implementation demonstrates that even with very short node lifetimes, it is possible to build a scalable and consistent system with practical performance.
Author-supplied keywords
Available from dl.acm.org
Page 1
Scalable Consistency in Scatter -
Scalable Consistency in Scatter Lisa Glendenning Ivan Beschastnikh Arvind Krishnamurthy Thomas Anderson Department of Computer Science & Engineering University of Washington ABSTRACT Distributed storage systems often trade off strong seman- tics for improved scalability. This paper describes the de- sign, implementation, and evaluation of Scatter, a scalable and consistent distributed key-value storage system. Scatter adopts the highly decentralized and self-organizing structure of scalable peer-to-peer systems, while preserving lineariz- able consistency even under adverse circumstances. Our prototype implementation demonstrates that even with very short node lifetimes, it is possible to build a scalable and consistent system with practical performance. Categories and Subject Descriptors H.3.4 [Information Storage and Retrieval]: Systems and Software—Distributed systems General Terms Design, Reliability Keywords Distributed systems, consistency, scalability, fault tolerance, storage, distributed transactions, Paxos 1. INTRODUCTION A long-standing and recurrent theme in distributed sys- tems research is the design and implementation of efficient and fault tolerant storage systems with predictable and well- understood consistency properties. Recent efforts in peer-to- peer (P2P) storage services include Chord [36], CAN [26], Pastry [30], OpenDHT [29], OceanStore [16], and Kadem- lia [22]. Recent industrial efforts to provide a distributed storage abstraction across data centers include Amazon’s Dynamo [10], Yahoo!’s PNUTS [8], and Google’s Megas- tore [1] and Spanner [9] projects. Particularly with geo- . graphic distribution, whether due to using multiple data centers or a P2P resource model, the tradeoffs between ef- ficiency and consistency are non-trivial, leading to systems that are complex to implement, complex to use, and some- times both. Our interest is in building a storage layer for a very large scale P2P system we are designing for hosting planetary scale social networking applications. Purchasing, installing, powering up, and maintaining a very large scale set of nodes across many geographically distributed data centers is an expensive proposition it is only feasible on an ongoing basis for those applications that can generate revenue. In much the same way that Linux offers a free alternative to commer- cial operating systems for researchers and developers inter- ested in tinkering, we ask: what is the Linux analogue with respect to cloud computing? P2P systems provide an attractive alternative, but first generation storage layers were based on unrealistic assump- tions about P2P client behavior in the wild. In practice, par- ticipating nodes have widely varying capacity and network bandwidth, connections are flaky and asymmetric rather than well-provisioned, workload hotspots are common, and churn rates are very high [27, 12]. This led to a choice for application developers: weakly consistent but scalable P2P systems like Kademlia and OpenDHT, or strongly consistent data center storage. Our P2P storage layer, called Scatter, attempts to bridge this gap – to provide an open-source, free, yet robust alter- native to data center computing, using only P2P resources. Scatter provides scalable and consistent distributed hash ta- ble key-value storage. Scatter is robust to P2P churn, het- erogeneous node capacities, and flaky and irregular network behavior. (We have left robustness to malicious behavior, such as against DDoS attacks and Byzantine faults, to fu- ture work.) In keeping with our goal of building an open system, an essential requirement for Scatter is that there be no central point of control for commercial interests to exploit. The base component of Scatter is a small, self-organizing group of nodes, each managing a range of keys, akin to a BigTable [6] tablet. A set of groups together partition the table space to provide the distributed hash table abstraction. Each group is responsible for providing consistent read/write access to its key range, and for reconfiguring as necessary to meet performance and availability goals. As nodes are added, as nodes fail, or as the workload changes for a re- gion of keys, individual groups must merge with neighbor- ing groups, split into multiple groups, or shift responsibil- 15
Page 2
ity over parts of the key space to neighboring groups, all while maintaining consistency. A lookup overlay topology connects the Scatter groups in a ring, and groups execute distributed transactions in a decentralized fashion to modify the topology consistently and atomically. A key insight in the design of Scatter is that the consistent group abstraction provides a stable base on which to layer the optimizations needed to maintain overall system perfor- mance and availability goals. While existing popular DHTs have difficulty maintaining consistent routing state and con- sistent name space partitioning in the presence of high churn, these properties are a direct consequence of Scatter’s design. Further, Scatter can locally adjust the amount of replication, or mask a low capacity node, or merge/split groups if a par- ticular Scatter group has an unusual number of weak/strong nodes, all without compromising the structural integrity of the distributed table. Of course, some applications may tolerate weaker consis- tency models for application data storage [10], while other applications have stronger consistency requirements [1]. Scat- ter is designed to support a variety of consistency models for application key storage. Our current implementation provides linearizable storage within a given key we support cross-group transactions for consistent updates to meta-data during group reconfiguration, but we do not attempt to lin- earize multi-key application transactions. These steps are left for future work however, we believe that the Scatter group abstraction will make them straightforward to imple- ment. We evaluate our system in a variety of configurations, for both micro-benchmarks and for a Twitter-style application. Compared to OpenDHT, a publicly accessible open-source DHT providing distributed storage, Scatter provides equiva- lent performance with much better availability, consistency, and adaptability. We show that we can provide practical distributed storage even in very challenging environments. For example, if average node lifetimes are as short as three minutes, therefore triggering very frequent reconfigurations to maintain data durability, Scatter is able to maintain over- all consistency and data availability, serving its reads in an average of 1.3 seconds in a typical wide area setting. 2. BACKGROUND Scatter’s design synthesizes techniques from both highly scalable systems with weak guarantees and strictly consis- tent systems with limited scalability, to provide the best of both worlds. This section overviews the two families of dis- tributed systems whose techniques we leverage in building Scatter. Distributed Hash Tables (DHTs): DHTs are a class of highly distributed storage systems providing scalable, key based lookup of objects in dynamic network environments. As a distributed systems building primitive, DHTs have proven remarkably versatile, with application developers hav- ing leveraged scalable lookup to support a variety of dis- tributed applications. They are actively used in the wild as the infrastructure for peer-to-peer systems on the order of millions of users. In a traditional DHT, both application data and node IDs are hashed to a key, and data is stored at the node whose hash value immediately precedes (or follows) the key. In many DHTs, the node storing the key’s value replicates A D join leave a1. a2. C D A B C A B C D join leave A C D b1. b2. (a) Key Assignment Violation (b) Routing Violation Figure 1: Two examples demonstrating how (a) key assignment consistency and (b) routing integrity may be violated in a traditional DHT. Bold lines in- dicate key assignment and are associated with nodes. Dotted lines indicate successor pointers. Both sce- narios arise when nodes join and leave concurrently, as pictured in (a1) and (b1). The violation in (a2) may result in clients observing inconsistent key val- ues, while (b2) jeopardizes overlay connectivity. the data to its neighbors for better reliability and availabil- ity [30]. Even so, many DHTs suffer inconsistencies in cer- tain failure cases, both in how keys are assigned to nodes, and in how requests are routed to keys, yielding inconsistent results or reduced levels of availability. These issues are not new [12, 4] we recite them to provide context for our work. Assignment Violation: A fundamental DHT correctness prop- erty is for each key to be managed by at most one node. We refer to this property as assignment consistency. This prop- erty is violated when multiple nodes claim ownership over the same key. In the figure, a section of a DHT ring is man- aged by three nodes, identified by their key values A, B, and C. A new node D joins at a key between A and B and takes over the key-range (A,D]. However, before B can let C know of this change in the key-range assignment, B fails. Node C detects the failure and takes over the key-range (A,B] maintained by B. This key-range, however, includes keys maintained by D. As a result, clients accessing keys in (A,D] may observe inconsistent key values depending on whether they are routed to node C or D. Routing Violation: Another basic correctness property stip- ulates that the system maintains consistent routing entries at nodes so that the system can route lookup requests to the appropriate node. In fact, the correctness of certain links is essential for the overlay to remain connected. For example, the Chord DHT relies on the consistency of node successor pointers (routing table entries that reference the next node in the key-space) to maintain DHT connectiv- ity [35]. Figure 1b illustrates how a routing violation may occur when node joins and leaves are not handled atomi- cally. In the figure, node D joins at a key between B and C, and B fails immediately after. Node D has a successor pointer correctly set to C, however, A is not aware of D and incorrectly believes that C is its successor (When a succes- sor fails, a node uses its locally available information to set its successor pointer to the failed node’s successor). In this scenario, messages routed through A to keys maintained by D will skip over node D and will be incorrectly forwarded to node C. A more complex routing algorithm that allows 16
Readership Statistics
69 Readers on Mendeley
by Discipline
3% Engineering
by Academic Status
58% Ph.D. Student
16% Student (Master)
6% Other Professional
by Country
38% United States
12% China
7% United Kingdom
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime




