Member post by ByteDance

Background

Since its open-source release in 2014, Kubernetes has rapidly become the de facto standard for container orchestration. The infrastructure team at ByteDance adopted Kubernetes early on to build our private cloud platform. Over the years, ByteDance’s rapid growth across various business lines, including microservices, recommendation/advertising/search services, machine learning & big data, and storage, has led to a substantial increase in the demand for computing resources.

Table showing outline of microservices, search, ads, recommendation, AI and Big Data and storage

Initially, ByteDance managed its online and offline workloads with separate resource pools, each dedicated to distinct business segments. To accommodate the surge in online business demands during significant holidays and major events, the infrastructure team usually needed to plan ahead by reallocating resources from offline to online pools to bolster the capacity for handling increased online activities. While this temporary fix satisfied immediate requirements, the inter-pool resource borrowing process proved to be time-consuming, operationally heavy, and inefficient. Furthermore, maintaining separate resource pools for online and offline workloads resulted in significant colocation costs, leaving little scope for enhancing resource utilization. Therefore, the infrastructure team sought to implement a unified system for scheduling and managing both online and offline workloads. This initiative aimed to facilitate resource pooling, enhance resource utilization and elasticity, optimize costs and user experiences, and alleviate operational burdens.

Practice of Unified Scheduling

Enhancement beyond Kubernetes Default Scheduler:

Since extensive use of Kubernetes in 2018, ByteDance continuously optimized various components of Kubernetes for functionality and performance. However, with the containerization of recommendation/advertising/search services in 2019, the native Kubernetes scheduler, in terms of both functionality and performance, was farther away from meeting ByteDance’s business requirements. In terms of functionality, more granular resource scheduling capabilities and flexible preemption strategies were required. In terms of performance, the native Kubernetes default scheduler could only achieve a scheduling throughput of around 10 pods per second in a cluster of 5000 nodes, often causing business upgrades to be bottlenecked, far from meeting the requirements. Therefore, the team introduced a number of key optimizations to the Kubernetes default scheduler, including:

Functionality:

Performance:

By implementing the aforementioned optimizations, we successfully enhanced our containerization capabilities to meet ByteDance’s rapidly expanding needs. This led to a remarkable 30-fold increase in scheduling throughput. That is, in a cluster comprising 10,000 nodes, we consistently achieved a scheduling rate of 300 pods per second.

Gödel Scheduler

In 2020, ByteDance initiated a unified scheduling and resource management project for both online and offline business operations. The objective was to enhance overall resource utilization, improve operational efficiency, and reduce maintenance overheads. Initially, the plan involved managing both online and offline tasks through a singular scheduling system. However, this approach presented challenges, primarily due to the intricate nature of offline scheduling, which differed markedly from online processes, especially in its demand for high throughput.

The native Kubernetes scheduler, primarily designed for Pod-level scheduling, was somewhat limited in its support for more complex “Job” scheduling semantics and encountered performance limitations when dealing with these higher-level demands. To effectively address these unique requirements and to better align with the diverse operational needs of ByteDance, the decision was made to develop a bespoke, in-house distributed scheduler. This led to the creation of the Gödel Scheduler, specifically tailored to integrate with the Kubernetes system and to handle the demanding and varied scheduling needs of ByteDance’s expansive and evolving business landscape.

The Gödel Scheduler is a distributed system crafted to consolidate the scheduling of both online and offline workloads. This scheduler is an enhancement of the Kubernetes (K8s) Scheduler, designed to augment scalability and improve scheduling quality. It is adept at fulfilling the diverse functional and performance demands of ByteDance’s online and offline operations. Key features of the Gödel Scheduler include:

Below is the architecture diagram of Gödel Scheduler.

the Gödel Scheduler architecture

As outlined, the Gödel Scheduler consists of three primary components: the Dispatcher, the Scheduler, and the Binder. Key to its architecture is the Scheduler component, which is typically deployed in multiple shards to facilitate optimistic concurrency scheduling. This multi-shard deployment enhances its efficiency and scalability. On the other hand, the Dispatcher and the Binder are each deployed as single instances, a configuration that suits their specific roles and responsibilities within the Gödel Scheduler system.

Dispatcher

The Dispatcher plays a pivotal role in managing application queuing, distribution, and node partitioning. It is comprised of several key components:

  1. Sort Policy Manager: This module handles the queuing of applications. Currently, it implements FIFO and DRF/FairShare queuing strategies, the latter still pending production use. Future enhancements will introduce more sophisticated queuing strategies, including those based on priority values.
  2. Dispatching Policy Manager: Its primary function is to allocate applications across various Scheduler instances. At present, the LoadBalance strategy is employed as the default. Future updates aim to make this feature more versatile and plugin-based.
  3. Node Shuffler: This component is tasked with organizing cluster nodes relative to the number of Scheduler instances. It assigns each node to a specific node partition, with each Scheduler instance overseeing one partition. During the scheduling process, a Scheduler first considers nodes within its partition before exploring nodes in other partitions. This arrangement is dynamically adjusted in response to changes in node availability or Scheduler count.
  4. Partition Rules: Currently, the system strives for an even distribution of nodes among Scheduler instances. Plans are underway to expand these partition strategies, enhancing their configurability.
  5. Scheduler Maintainer: This element is responsible for monitoring the status of Scheduler instances. It tracks aspects like health status, workload, and the node count within each partition.
  6. Reconciler: Operating periodically, the Reconciler oversees the status of various elements like Pods, Nodes, Schedulers, and SchedulingUnits. It addresses any errors, discrepancies, or deficiencies, ensuring system integrity and performance.

Scheduler

The Scheduler plays a critical role in the decision-making process for scheduling and preempting applications, although it does not execute these decisions itself (that task is handled by the Binder). It operates on a two-tier framework: the Unit Scheduling Framework and the Pod Scheduling Framework. The entire scheduling procedure is segmented into three principal phases: Node Organizing, Unit Scheduling, and Unit Preempting.

  1. Node Organizing: This phase involves filtering and sorting nodes to streamline the scheduling process and enhance its quality. It consists of two types of plugins:
  1. Unit Scheduling: In this stage, nodes are matched and scored in alignment with application requests that have been filtered through the Node Organizing plugins. This process is analogous to the Kubernetes (K8s) Scheduler framework, encompassing:
  1. Unit Preempting: If suitable nodes are not found during the Unit Scheduling phase, the Scheduler progresses to the preemption phase. Here, it tries to free up resources by preempting running application instances to make room for new ones. This phase includes:

Binder

The Binder plays a crucial role in the final stages of the scheduling process, focusing on conflict detection, preemptive operations, and executing the binding of applications to resources. It comprises three main components: ConflictResolver, PreemptionOperator, and UnitBinder.

  1. ConflictResolver: This component is tasked with detecting concurrent conflicts in the scheduling process. It operates in two modes:
  1. PreemptionOperator: In scenarios where no conflict exists but preemption is necessary, this operator takes charge. It executes the preemption by deleting the victims (applications or processes that need to be terminated to free up resources) and then awaits the final scheduling.
  2. UnitBinder: This part of the Binder is responsible for the preparatory work required before binding, such as dynamically creating storage volumes, and then carries out the actual binding operation, linking applications to the designated resources.

Noteworthy, the current version of the Binder integrates a PodGroup controller. This controller is responsible for managing the state and lifecycle of PodGroups. However, it’s important to note that in a future version we plan to remove this functionality from the Binder, transitioning it into an independent controller. 

Experience

Over the past two years, the Gödel Scheduler has been a cornerstone within ByteDance, offering a wealth of scheduling features and semantics. It has efficiently and reliably supported the operations of ByteDance’s diverse and complex business workloads.

Building upon the foundation of architectural enhancements, ByteDance has implemented profound performance optimizations drawing from its experience with the Kubernetes native scheduler. Integrated with ByteDance’s internally refined Kubernetes system, the Gödel Scheduler now boasts an impressive throughput: 2000+ pods/s in a single shard and  5000+ pods/s across multiple shards. ByteDance’s ongoing efforts to expand single-cluster capacity have culminated in their largest prod cluster reaching over 20,000 nodes and more than 1,000,000 pods.

After years of thorough internal practices and enhancements within ByteDance, Gödel Scheduler has achieved a state of relative stability. In 2023, the top-notch cloud computing conference, SoCC, accepted our paper on Gödel Scheduler, highlighting ByteDance’s unified approach to large-scale resource management and scheduling. The RD team was also invited to present the work at the conference. For those interested, the Gödel Scheduler paper is available at https://dl.acm.org/doi/10.1145/3620678.3624663.

With a commitment to contributing to the open-source community, the Bytedance team decided to open-source the Gödel Scheduler, offering a new scheduling solution that enhances cloud-native experiences for both online and offline services through its outstanding performance and comprehensive scheduling capabilities.

Future Work

Looking ahead, ByteDance is committed to the continual development of the Gödel Scheduler, focusing on enriching its features and enhancing its scalability. A significant area of attention will be optimizing the scheduling throughput in specific challenging scenarios, such as those involving high rates of deployment and frequent preemptions. Through innovative rescheduling strategies, ByteDance aims to tackle the intricate balance between maintaining scheduling performance and enhancing its quality. The overarching goal is to not only preserve the current scheduling throughput but also to substantially elevate the quality of scheduling.

Moreover, ByteDance places a high priority on ecosystem development. Efforts will be made to ensure Gödel Scheduler’s compatibility with leading systems and frameworks used in various business applications. This initiative will include integration with prominent big data and machine learning frameworks, accompanied by practical usage examples and comprehensive documentation.

To keep the community engaged and informed, a detailed roadmap for the Gödel Scheduler will be methodically laid out and made available on the Gödel Scheduler Repository. This will provide an opportunity for interested parties to track progress, contribute, and become active participants in the project.

While the Gödel Scheduler has undergone numerous iterations within ByteDance, been rigorously tested in various scenarios, and demonstrated its effectiveness, ByteDance acknowledges that there is still considerable potential for advancement in terms of generality and standardization. ByteDance warmly invites and encourages members of the community to join in the development of the Gödel Scheduler, believing that collaborative efforts will lead to even greater improvements and innovations.

Gödel Scheduler Project Repository: https://github.com/kubewharf/godel-scheduler