Super bot for Kubernetes clusters

Posted on February 14, 2023 by Vishal Anand + Utpal Mangla + Saurabh Agrawal + Luca Marchi

CNCF projects highlighted in this post

Guest post by:

One stop shop messaging bot for monitoring, notifying and debugging anywhere, anytime.

Bots have been around humans for a while now and used for variety of purposes. The most common ones are notification receivers through Incoming Webhooks which are legacy.

However, there is a need for a “Super Bot” that is designed exclusively for Kubernetes and serves as a one stop shop for all the requirements. It should be modern, based on Open-Source technologies (Bot’s backend is) and not be for notifications only. It should also not be multiple Bots, each performing one task at a time at different places (channels).

After considering few options, came across a Bot that is from one of the apps on a well-known messaging platform. When we familiarized ourselves with its capabilities it simply impressed us (we will see why) and hence, called it “Super Bot. The Bot can monitor events on Kubernetes cluster(s) and notify users in real-time. It also allows debugging Kubernetes cluster(s) and enable the health checks e.g., cluster health, connectivity etc. — across multiple Kubernetes clusters (both Public and Private).

As SRE/DevOps personas we examined what this Bot really means.

Architecture

Let us first have a quick look at its Architecture and what Kubernetes resources and commands it supports.

As can be seen from above Architecture (high level) diagram, it has integrated technology for monitoring, notification, and execution — the very reason, we call it ‘Super Bot’. It can support multiple Kubernetes clusters, both public and private, at the same time.

The backend communicates with Kubernetes API Server to monitor Kubernetes events (note, it talks about events) and forwards them to communication mediums like Slack (or Mattermost). It also reads messages (commands) from users and sends response (output) accordingly. Backend is installed on Kubernetes cluster (s).

It supports the following Kubernetes resources: Pod, node, service, namespace, replicationcontroller, persistentvolume, persistentvolumeclaim, secret, configmap, deployment, daemonset, replicaset, ingress, job, role, rolebinding, clusterrole, clusterrolebinding.

It supports the following Kubernetes (debug) commands: get, top, cluster-info, describe, explain, logs, version, auth, api-resources, api-versions, diff.

It also supports other Bot specific commands to support users’ choice as following: notifier [stop/start/status/showconfig], ping –clustername <clustername>, filters list [enable/disable] <filtername>.

Use Cases

Let us now setup the Bot and examine some of the use cases to understand its capabilities

To start with, the bot was installed and configured on test messaging platform and its backend on OpenShift cluster through Helm.

Ran a command to check Bot’s connectivity to OpenShift cluster. It responded pong to ping.
Ran several debugging commands to check the status of the OpenShift cluster e.g. component status, the status of scheduler, controller-manager, etcd components of the control plane, status of pods in a given namespace.
Checked the cpu, memory consumption of each pod.
Checked the details of Apache httpd deployment (for example).
The Bot notified in real-time when a deployment was updated automatically and internally in our cluster and when statefulset was updated automatically.
Next, applied a bit of a Chaos Engineering (part of SRE practices) by running one of the ChaosMonkeys designed by us to target or kill only MySQL pods (namespaced). Though they are designed to create Chaos to test / improve reliability randomly, we triggered this Chaos chart to test the response of the SuperBot.
Triggered the ChaosMonkey for OpenShift cluster and the Bot notified of MySQL pods getting deleted and re-created in chaos namespace of OpenShift cluster — all in a fraction of time (2 seconds measured).
This way, it also tested resiliency / reliability of MySQL service. Bot, obviously, got tested for its expected capability of notifications.

Summary

We were delighted with our experience as DevOps and SRE personas. This Bot can run-on multiple Kubernetes clusters (we tried with 2 OpenShift ones). Bot’s backend is Open Source with community power — which is great.

It is true that we can debug anywhere, anytime with this Bot. It is indeed a one stop shop Bot for needs in several ways to take care of Kubernetes clusters i.e. events monitoring, notification, debugging and running health checks, testing reliability all in a fraction of time — in real time. Its capabilities may also be extended to enable closed loop automation for any corrective action. We could also stop/start notifications as per our choice.

Most importantly, it enables new ways of working

Disclaimer: The views expressed are personal ones here.

Hyderabad, India

One stop shop messaging bot for monitoring, notifying and debugging anywhere, anytime.

Architecture

Use Cases

Summary