Member post originally published on SuperOrbital’s blog by Keegan McCallum

Introduction

In the previous parts of this blog post series, we explored the fundamentals of using the NVIDIA Device Plugin to manage GPU resources in Kubernetes clusters (Part 1) and dove into advanced configuration and troubleshooting techniques (Part 2). We emphasized the significance of efficient GPU management in Kubernetes environments to support large-scale machine learning workloads.

In this final part, we are going to address the issue of GPU underutilization in Kubernetes and discuss a few innovative projects and solutions aimed at optimizing resource usage and cost-effectiveness. By maximizing GPU utilization, organizations can unlock the full potential of their GPU investments by getting the most out of each GPU.

The Problem of GPU Underutilization in Kubernetes

After reading this series, you’ve got the nvidia-device-plugin running on your cluster, and you’ve spun up some GPU nodes. Your code is recognizing the CUDA devices and working, so, things are good right? Well, not quite. Actually using anywhere near 100% of your GPUs, especially with diverse heterogeneous workloads is non-trivial(in fact, OpenAI estimates only about 33% usage). It’s also an important issue to solve given the extremely high cost of GPUs, you’re leaving a non-trivial amount of money on the table if you aren’t considering utilization.

Okay, so why is it so darn hard to use 100% of your GPUs? There are a few factors:

  1. Static partitioning: GPUs are statically partitioned into fixed-size chunks, which may not align with the varying resource requirements of different workloads.
  2. Workload variability: The dynamic nature of workloads, with fluctuating resource demands, can result in periods of underutilization when GPUs are not fully utilized. Basically, even if each workload perfectly fits a static partition, there will likely be one GPU partially utilized, and this compounds in real-world workloads.
  3. Naive Scheduling: If you naively rely on the default Kubernetes scheduler configuration, you will likely face issues with bin-packing. Imagine a scenario where there are 10 nodes, and 8 GPUs on each, for a total of 80 GPUs. If no affinity or taints are specified, the scheduler is going to allocate the GPUs randomly, leading to a situation where there may be a handful of unused GPUs on each node. If you need to schedule a single pod using 8 GPUs, you may not be able to even though there are 10 or more GPUs left idle. As of Kubernetes 1.24, you can use the MostAllocated strategy to solve this problem. Setting the strategy to MostAllocated scores the nodes based on the utilization of resources, favoring the ones with higher allocation. You can set a weight for each resource type to control how it influences the end score. An example configuration only considering GPUs would look something like:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
  - name: NodeResourcesFit
    args:
     scoringStrategy:
        type: MostAllocated
        resources:
          - name: nvidia.com/gpu
            weight: 1

Now that we’ve identified the problem and the reasons it might occur, the first step is to set up monitoring to get some visibility when it’s happening.

Monitoring GPU Utilization using dcgm-exporter & prometheus

We’ve touched on using the dcgm-exporter and prometheus to monitor GPU health in (Part 2), if you haven’t read it it’s worth a look. Now, let’s talk about how to monitor GPU utilization because it’s actually not well documented. dcgm-exporter exports a lot of metrics, and it can be difficult to understand exactly which one you are looking for. At first glance, it’d be understandable to look to DCGM_FI_DEV_GPU_UTIL as a straightforward way to monitor GPU utilization, because well, it’s documented as being “gpu utilization”. Buuuuut, you’d be wrongDCGM_FI_DEV_GPU_UTIL is outdated, has several limitations and isn’t compatible with MIGDCGM_FI_PROF_GR_ENGINE_ACTIVE is roughly equivalent, supports MIG, and has higher precision.

DCGM_FI_PROF_GR_ENGINE_ACTIVE will give you a high-level understanding of whether there are kernels running on any streaming multiprocessor (SM) of the GPU, which is a great start to understanding if the entire GPU is being used at all. You may want to go deeper, especially if you are deploying your own CUDA code and want to squeeze as much performance out of the GPU as possible (this post is a great primer on the CUDA programming model). PROF_SM_ACTIVE will tell you what ratio of the SMs on the GPU is active, giving you a lower-level look at the utilization of the actual SMs on a given GPU. You can go one step further if you’re feeling brave, and use PROF_SM_OCCUPANCY to get the ratio of the number of warps that are actually running on a given SM relative to the theoretical maximum (2048 per SM). If that made your eyes glaze over, start with DCGM_FI_PROF_GR_ENGINE_ACTIVE and dig in further when the need arises.

Projects and Solutions for Improving GPU Utilization

We’re now aware of GPU underutilization, and have visibility into the problem for our clusters. Let’s take a whirlwind tour through some projects that can help us maximize our GPU utilization. I will note that this list is rather sparse, but the following solutions are emerging to tackle this costly problem.

Nvidia Dynamic Resource Allocation (DRA)

First up, DRA, a solution from NVIDIA, is currently under active development and not yet suitable for production. I’m confident this will be the project to use once it’s ready but not just yet. As the name suggests, DRA aims to tackle the static partitioning problem by dynamically partitioning GPUs using MIG and/or MPS-based on the workload launched. Taking a slightly different approach than prior work, the configuration is modeled after Persistent Volume Claims (PVC) rather than pod resource requests/limits. You first create claims for resources, and then reference those claims from your workloads to use them in a given pod. For MPS-based partitioning, the configuration is as follows:

---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
  namespace: sharing-demo
  name: gpu-mps-sharing
spec:
  resourceClassName: gpu.nvidia.com
  parametersRef:
    apiGroup: gpu.resource.nvidia.com
    kind: GpuClaimParameters
    name: gpu-mps-sharing
---
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
  namespace: sharing-demo
  name: gpu-mps-sharing
spec:
  sharing:
    strategy: MPS
    mpsConfig:
      defaultActiveThreadPercentage: 50
      defaultPinnedDeviceMemoryLimit: 10Gi
      # defaultPerDevicePinnedMemoryLimit:
      #   0: 5Gi
---
apiVersion: v1
kind: Pod
metadata:
  namespace: sharing-demo
  name: pod1
  labels:
    app: pod
spec:
  containers:
  - name: ctr
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1-ubuntu18.04
    args: ["--benchmark", "--numbodies=4226000"]
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    source:
      resourceClaimName: gpu-mps-sharing

You can also check out this example to see what this would look like using both MIG and MPS. Keep an eye on this project because one day soon it will be the answer to your underutilization woes.

Nebuly OS (nos)

…until then, there’s Nebuly OS to save the day! It has the opposite problem: it’s ready for production and actively being used, but from the looks of the repo and docs (which now point to a completely different project), it’s being put into maintenance mode while the company behind it pivots. Luckily, docs are still available via GitHub, and the project can fill the gaps while we patiently wait for DRA to be ready for prime time.

Nebuly OS took the more intuitive approach to dynamic GPU partitioning, and partitions gpus based on resources. The yaml key used for requests/limits includes the the desired amount of memory, rather than introducing a second key the way typical requests/limits work. There are also a few requirements to use with MPS. For MPS the configuration is as follows:

apiVersion: v1
kind: Pod
metadata:
  name: mps-partitioning-example
spec:
  hostIPC: true # NOTE: for MPS hostIPC must be set to true
  securityContext:
    runAsUser: 1000 # NOTE: for MPS containers must run as the same user as the MPS Server
  containers:
    - name: sleepy
      image: "busybox:latest"
      command: ["sleep", "120"]
      resources:
        limits:
          nvidia.com/gpu-10gb: 1 # this is specifying the part of the GPU with 10 GB of memory

MIG is a little simpler:

apiVersion: v1
kind: Pod
metadata:
  name: mig-partitioning-example
spec:
  containers:
    - name: sleepy
      image: "busybox:latest"
      command: ["sleep", "120"]
      resources:
        limits:
          nvidia.com/mig-1g.10gb: 1

Again, this project is in maintenance mode, but production-ready. So if you need a solution for dynamic GPU partition NOW, this is your tool. Just be prepared to migrate over to DRA once it’s ready for prime time.

Conclusion

Throughout this series of blog posts, we’ve explored the importance of efficient GPU management in Kubernetes environments. From the fundamentals of using the NVIDIA Device Plugin to advanced configuration and troubleshooting techniques, we have covered key aspects of GPU resource management.

In this final part, we focused on the critical issue of GPU underutilization and discussed techniques as well as solutions aimed at addressing this challenge. By leveraging tools like dcgm-exporter and prometheus for monitoring GPU utilization and exploring solutions like DRA and Nebuly OS, organizations can optimize their GPU resources and achieve better performance and cost-efficiency.

As the adoption of GPU-accelerated workloads continues to grow, it is crucial for organizations to stay informed about the latest developments and advancements in GPU management techniques. By embracing innovative solutions and best practices, organizations can unlock the full potential of their GPU investments to drive the success of their AI initiatives.

Further Reading and Resources