Abstract
We describe our experience developing what we believe to be the world's first large-scale production deployments of lightwave fabrics used for both datacenter networking and machine-learning (ML) applications. Using optical circuit switches (OCSes) and optical transceivers developed in-house, we employ hardware and software codesign to integrate the fabrics into our network and computing infrastructure. Key to our design is a high degree of multiplexing enabled by new kinds of wavelength-division-multiplexing (WDM) and optical circulators that support high-bandwidth bidirectional traffic on a single strand of optical fiber. The development of the requisite OCS and optical transceiver technologies leads to a synchronous lightwave fabric that is reconfigurable, low latency, rate agnostic, and highly available. These fabrics have provided substantial benefits for long-lived traffic patterns in our datacenter networks and predictable traffic patterns in tightly-coupled machine learning clusters. We report results for a large-scale ML superpod with 4096 tensor processing unit (TPU) V4 chips that has more than one ExaFLOP of computing power. For this use case, the deployment of a lightwave fabric provides up to 3× better system availability and model-dependent performance improvements of up to 3.3× compared to a static fabric, despite constituting less than 6% of the total system cost.
Author supplied keywords
Cite
CITATION STYLE
Liu, H., Urata, R., Yasumura, K., Zhou, X., Bannon, R., Berger, J., … Vahdat, A. (2023). Lightwave Fabrics: At-Scale Optical Circuit Switching for Datacenter and Machine Learning Systems. In SIGCOMM 2023 - Proceedings of the ACM SIGCOMM 2023 Conference (pp. 499–515). Association for Computing Machinery, Inc. https://doi.org/10.1145/3603269.3604836
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.