Ambassador post by Swastik Gour

Introduction

The emergence of Large Language Models (LLMs) such as GPT-4, PHI2, BERT, and T5 revolutionized natural language processing, with these models empowering high-end applications, including chatbots, recommendation systems, and analytics. Yet the scale and complexity of workloads in LLMs make them a great challenge to guarantee performance and reliability. It is under such circumstances that monitoring and observability practices are more than essential while deploying workloads using frameworks such as Ray.

Ray is a distributed computing framework that offers a powerful platform to scale LLM workloads efficiently across clusters. Therefore, it becomes an excellent choice for hosting, managing, and observing LLMs. The observability of critical metrics with Ray’s built-in features in conjunction with Prometheus and Grafana will help users to monitor them efficiently, optimize the use of resources, and rapidly diagnose problems in production.

This article explores the importance of observability in Ray-hosted LLM workloads, key metrics to monitor, and a detailed guide to setting up observability using Prometheus and Grafana.

Why Ray for LLM Workloads?

Ray is designed for distributed, scalable applications, making it ideal for hosting and managing LLM workloads. Key features that make Ray an excellent choice include:

These features make Ray not just a compute framework but a foundational tool for running, monitoring, and scaling LLMs in real-world applications.

Key Metrics for Observing Ray-Hosted LLM Workloads

To ensure the smooth operation of Ray-hosted LLM workloads, it’s critical to track a range of performance, resource utilization, and operational metrics. Below are the key categories:

Performance Metrics

Resource Utilization Metrics

Operational Metrics

Setting Up Observability for Ray-Hosted Workloads

Observability in Ray involves using metrics to understand system performance and diagnose issues. By integrating Ray with Prometheus and Grafana, you can gain deep insights into workload behavior.

Step 1: Setting Up Prometheus Monitoring

Prometheus is an open-source monitoring system that collects metrics from Ray’s endpoints. Follow the guide below to set up Prometheus with Ray on Kubernetes.

Install Prometheus with KubeRay:

# Path: kuberay/
./install/prometheus/install.sh

# Check the installation
kubectl get all -n prometheus-system

Configure Pod and Service Monitors

Set up PodMonitor and ServiceMonitor resources to scrape metrics from Ray head and worker nodes:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: ray-workers-monitor
  namespace: prometheus-system
  labels:
    release: prometheus
    ray.io/cluster: rayservice-sample-raycluster-bpkgv
spec:
  jobLabel: ray-workers
  namespaceSelector:
    matchNames:
      - raysvc
  selector:
    matchLabels:
      ray.io/node-type: worker
  podMetricsEndpoints:
    - port: metrics
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: resume-analyzer-monitor
  namespace: prometheus-system
  labels:
    release: prometheus
spec:
  jobLabel: resume-analyzer
  namespaceSelector:
    matchNames:
      - raysvc
  selector:
    matchLabels:
      ray.io/node-type: head
    endpoints:
      - port: metrics
    targetLabels:
      - ray.io/cluster

Step 2: Configure Recording Rules

Recording rules allow you to precompute PromQL expressions for faster queries. For example, calculating the availability of the Ray Global Control Store (GCS):

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ray-cluster-gcs-rules
  namespace: prometheus-system
  labels:
    release: prometheus
spec:
  groups:
  - name: ray-cluster-main-staging-gcs.rules
    interval: 30s
    rules:
    - record: ray_gcs_availability_30d
      expr: |
        (
          100 * (
            sum(rate(ray_gcs_update_resource_usage_time_bucket{container="ray-head", le="20.0"}[30d]))
            /
            sum(rate(ray_gcs_update_resource_usage_time_count{container="ray-head"}[30d]))
          )
        )

Explanation of the Expression:

Step 3: Set Up Alerting Rules

Alert rules help identify issues proactively. For example, detecting missing GCS metrics:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ray-cluster-gcs-rules
  namespace: prometheus-system
  labels:
    release: prometheus
spec:
  groups:
  - name: ray-cluster-main-staging-gcs.rules
    interval: 30s
    rules:
    - alert: MissingMetricRayGlobalControlStore
      expr: |
        absent(ray_gcs_update_resource_usage_time_count)
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Missing Ray GCS metrics"

Setting Up Grafana Dashboards

Grafana provides rich visualizations for metrics. Here’s how to set up dashboards for Ray:

Step 1: Capture Default Dashboards

Copy default dashboards from the Ray head pods:

kubectl cp <head-pod>:/tmp/ray/session_latest/metrics/grafana/dashboards/ ./dashboards

Step 2: Access the Grafana Dashboard

kubectl port-forward deployment/prometheus-grafana -n prometheus-system 3000:3000

Default login credentials:

Enable Profiling in Ray Serve Pods

Advanced Profiling in Ray Serve Pods The profiling of inference workloads relies on sophisticated techniques for monitoring, debugging, and optimizing performance. This section digs into specific tools, configurations, and scenarios to augment your profiling abilities.

Memory Profiling

Memory profiling is essential for memory leaks detection and usage optimization. For example, with Memray, trace memory allocations and understand the behavior of inference tasks. To enable memory profiling in Ray Serve pods, update the container’s security context to allow tracing:

securityContext:
  capabilities:
    add:
    - SYS_PTRACE

Once configured, Memray can be used to generate memory usage reports, which can help identify high-memory-consuming tasks or bottlenecks in the system.

Example Use Case:

CPU Profiling

For CPU profiling, tools like gdblldb, or py-spy can be installed within the worker pods to collect detailed CPU usage data. These tools allow you to monitor which functions consume the most CPU time, enabling targeted optimizations.

To set up CPU profiling:

  1. Install gdb or lldb in the ray worker pod.
  2. Use profiling scripts or tools to capture CPU usage snapshots during inference tasks.

Example Use Case:

End-to-End Profiling Example

When you integrate memory and CPU profiling, it gives you an overarching view of system performance. To illustrate this better, consider an LLM inference task where you have latency spikes. If you correlate your memory and CPU profiles you will find:

The main culprit behind the memory usage is that huge batches of input data.

CPU bottlenecks are caused due to inefficiencies in tokenization functions.

If you optimize batch sizes and refactor bottleneck functions your performance might increase up to a considerable extent.

Conclusion

Using Ray’s distributed LLM workloads with the observability of robust tools will ensure that teams get performance, reliability, and scalability out of these systems. This is a guide to set up and monitor LLM workloads on Ray in a very practical way. Proper observability will help developers and operators find issues early, optimize the use of resources, and further improve the experience users get when using NLP applications.