Distributed in-memory caching is a key component of modern Internet services. Such caches are often accessed via remote procedure call (RPC), as RPC frameworks provide rich support for productionization, including protocol versioning, memory efficiency, auto-scaling, and hitless upgrades. However, full-featured RPC limits performance and scalability as it incurs high latencies and CPU overheads. Remote Memory Access (RMA) offers a promising alternative, but meeting productionization requirements can be a significant challenge with RMA-based systems due to limited programmability and narrow RMA primitives. This paper describes the design, implementation, and experience derived from CliqueMap, a hybrid RMA/RPC caching system. CliqueMap has been in production use in Google's datacenters for over three years, currently serves more than 1PB of DRAM, and underlies several end-user visible services. CliqueMap makes use of performant and efficient RMAs on the critical serving path and judiciously applies RPCs toward other functionality. The design embraces lightweight replication, client-based quoruming, self-validating server responses, per-operation client-side retries, and co-design with the network layers. These foci lead to a system resilient to the rigors of production and frequent post deployment evolution.
CITATION STYLE
Singhvi, A., Akella, A., Anderson, M., Cauble, R., Deshmukh, H., Gibson, D., … Vahdat, A. (2021). CliqueMap: Productionizing an RMA-based distributed caching system. In SIGCOMM 2021 - Proceedings of the ACM SIGCOMM 2021 Conference (pp. 93–105). Association for Computing Machinery, Inc. https://doi.org/10.1145/3452296.3472934
Mendeley helps you to discover research relevant for your work.