As the de facto standard in cloud-native batch computing, Volcano has been widely adopted across various scenarios, including AI, Big Data, and High-Performance Computing (HPC). With over 800 contributors from more than 30 countries and tens of thousands of code commits, Volcano has been deployed in production environments by over 60 enterprises worldwide. It provides the industry with excellent practical standards and solutions for cloud native batch computing.
As user scenarios grow increasingly complex, especially in the scenarios of LLMs, there is a heightened demand for performance, GPU resource utilization, and availability in both training and inference workloads. This has driven Volcano to continuously expand its capabilities and address core user needs. Over the course of 28 releases, Volcano has introduced a series of enhancements and optimizations tailored to batch computing scenarios, helping users seamlessly migrate their workloads to cloud-native platforms. These improvements have resolved numerous pain points, earning the platform widespread praise and fostering a vibrant community with over 30 approvers and reviewers, creating a win-win ecosystem.
The new release of Volcano will mark a new milestone in the New Year 2025, where the community will introduce a series of major features that will continue to deepen its focus on areas such as CNAI (Cloud Native AI) and Big Data, with key features including:
AI Scenarios:
- Network Topology-Aware Scheduling: Reduces network communication overhead between training tasks, optimizing performance for large AI model training.
- NPU Scheduling and Virtualization: Enhances NPU resource utilization.
- GPU Dynamic Partitioning: Introduces MIG and MPS dynamic partitioning to improve GPU resource utilization.
- Volcano Global for Multi-Cluster AI Job Scheduling: Supports Multi-cluster AI job deployment and distribution.
- Checkpointing and Fault Recovery Optimization: Enables finer-grained job restart policies.
- Dynamic Resource Allocation (DRA): Supports flexible and efficient management of heterogeneous resources.
Big Data Scenarios:
- Elastic Hierarchical Queues: Facilitates smooth migration of Big Data workloads to cloud-native platforms.
Microservices Scenarios:
- Online and Offline Workloads colocation with Dynamic Resource Oversubscription: Boosts resource utilization while ensuring QoS for online workloads.
- Load-Aware Scheduling and Descheduling: Provides resource defragmentation and load balancing capabilities.
The official release of Volcano v1.11 marks a new chapter in cloud-native batch computing! This update focuses on the core needs of AI and Big Data, introducing network topology-aware scheduling and multi-cluster AI job scheduling, significantly enhancing the performance of AI training and inference tasks. Additionally, online and offline workloads colocation with dynamic resource Oversubscription and load-aware descheduling further optimize resource utilization, ensuring high availability for online services. The introduction of elastic hierarchical queues offers more flexible scheduling strategies for Big Data scenarios.
Deep Dive into Key Features
Visit https://volcano.sh/en/blog/volcano-1.11.0-release/ to dive deep into Volcano v1.11 key features.
Experience Volcano v1.11.0 now and step into a new era of efficient computing!
Volcano v1.11.0 release: https://github.com/volcano-sh/volcano/releases/tag/v1.11.0
Acknowledgments
Volcano v1.11.0 includes contributions from 39 community members. Special thanks to all contributors:
@QingyaFan | @JesseStutler | @bogo-y |
@bibibox | @zedongh | @archlitchi |
@dongjiang1989 | @william-wang | @fengruotj |
@SataQiu | @lowang-bh | @Rui-Gan |
@xovoxy | @wangyang0616 | @PigNatovsky |
@Yanping-io | @lishangyuzi | @hwdef |
@bood | @kerthcet | @WY-Dev0 |
@raravena80 | @SherlockShemol | @zhifanggao |
@conghuhu | @MondayCha | @vie-serendipity |
@Prepmachine4 | @Monokaix | @lengrongfu |
@jasondrogba | @sceneryback | @TymonLee |
@liuyuanchun11 | @Vacant2333 | @matbme |
@lekaf974 | @kursataktas | @lut777 |