Gödel: Unified Large-Scale Resource Management and Scheduling at ByteDance

12Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Over the last few years, at ByteDance, our compute infrastructure scale has been expanding significantly due to expedited business growth. In this journey, to meet hyper-scale growth, some business groups resorted to managing their own compute infrastructure stack running different scheduling systems such as Kubernetes, YARN which created two major pain points: the increasing resource fragmentation across different business groups and the inadequate resource elasticity between workloads of different business priorities. Isolation across different business groups (and their compute infrastructure management) leads to inefficient compute resource utilization and prevents us from serving the business growth needs in the long run. To meet these challenges, we propose a resource management and scheduling system named Gödel, which provides a unified compute infrastructure for all business groups to run their diverse workloads under a unified resource pool. It co-locates various workloads on every machine to achieve better resource utilization and elasticity. Gödel is built upon Kubernetes, the de facto open-source container orchestration system, but with significant components replaced or enhanced to accommodate various workloads at a large scale. In production, it manages clusters with tens of thousands of machines, achieves high overall resource utilization of over 60%, and scheduling throughput of up to 5000 pods per second. This paper reports on our design and implementation with Gödel. Moreover, it discusses the lessons and best practices we learned in developing and operating it in production at ByteDance’s scale.

Author supplied keywords

Cite

CITATION STYLE

APA

Xiang, W., Li, Y., Ren, Y., Jiang, F., Xin, C., Gupta, V., … Liang, Y. (2023). Gödel: Unified Large-Scale Resource Management and Scheduling at ByteDance. In SoCC 2023 - Proceedings of the 2023 ACM Symposium on Cloud Computing (pp. 308–323). Association for Computing Machinery, Inc. https://doi.org/10.1145/3620678.3624663

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free