Guest post originally published on the Fairwinds blog by Sarah Zelechoski, VP of engineering at Fairwinds
The truth is Kubernetes monitoring done right is a fantasy for most. It’s a problem magnified in a dynamic, ever-changing Kubernetes environment. And it is a serious problem.
While organizations commonly want availability insurance, few monitor their environments well for two main reasons:
- It’s hard for monitoring to keep up with changing environments.
- Monitoring configuration is often an afterthought-it isn’t set up until a problem occurs, and monitoring updates are seldom made as workloads change.
When the average organization finally recognizes its need for application/system monitoring, the team is too overwhelmed just trying to keep infrastructure and applications “up” to have the capacity to look out for issues. Even monitoring the right things to identify the problems the application or infrastructure is facing on a day-to-day basis is beyond the reach of many organizations.
Consequences of the Inadequate Monitoring in Kubernetes
There are a number of consequences you’ll face without adequate monitoring (some that are universal, others that are exemplified in Kubernetes).
- Without the right monitoring, operations can be interrupted.
- Your SRE team may be unable to respond to issues (or the right issues) as fast as needed.
- Monitoring management must reflect the state of clusters and workloads.
- Manual configuration increases availability and performance risks because monitors may not be present or accurate enough to trigger changes in key performance indicators (KPIs).
- Undetected issues may cause SLA breaches.
- Noisy pagers can result due to incorrect monitor settings.
Insufficient monitoring introduces a lot of heavy work because you need to constantly check systems to ensure they reflect the state that you want.
Kubernetes Best Practices for Monitoring and Alerting
What’s needed is monitoring and alerting that discovers unknown unknowns – otherwise referred to as observability. Kubernetes best practices involve recognition that monitoring is key and requires the use of the right tools to optimize your monitoring capabilities. What needs to be monitored and why? Here we suggest a few best practices.
Create your Monitoring Standards
With Kubernetes, you have to build monitoring systems and tooling to respond to the dynamic nature of the environment. Thus, you will want to focus on availability and workload performance. One typical approach is to collect all of the metrics you can and then use those metrics to try to solve any problem that occurs. It makes the operators’ jobs more complex because they need to sift through an excess of information to find the information they really need. Open source tools like Prometheus and OpenMetrics help standardize how to collect and display metrics. We suggest that Kubernetes best practices for monitoring includes:
- Kubernetes deployment with no replicas
- Horizontal Pod Autoscaler (HPA) scaling issues
- Host disk usage
- High IO wait times
- Increased network errors
- Increase in pods crashed
- Unhealthy Kubelets
- nginx config reload failures
- Nodes that are not ready
- Large number of pods that are not in a Running state
- External-DNS errors registering records
Implement Monitoring as Code
A genius of Kubernetes is that you can implement infrastructure as code (IaC) – the process of managing your IT infrastructure using config files. At Fairwinds take this a step further by implementing monitoring as code. We use Astro, an open source software project built by our team, to help achieve better productivity and cluster performance. Astro was built to work with Datadog. Astro watches objects in your cluster for defined patterns and manages monitors based on this state. As a controller that runs in a Kubernetes cluster, it subscribes to updates within the cluster. If a new Kubernetes deployment or other objects are created in a cluster, Astro knows about it and creates monitors based on that state in your cluster.
Identify Ownership
Because a diverse set of stakeholders is involved in monitoring cluster workloads, you must determine who is responsible for what from both an infrastructure and a workload standpoint. For instance, you want to make sure the right people are alerted at the right time to limit the noise of being alerted about things that do not pertain to you.
Move Beyond Tier 1 to Tier 2 Monitoring
Monitoring tooling must be flexible enough to meet complex demands, yet easy enough to set up quickly so that we can move beyond tier 1 monitoring (e.g., Is it even working?”). Tier 2 monitoring requires dashboards that reveal where security vulnerabilities are, whether or not compliance standards are being met, and targeted ways to improve.
Define Urgent
Impact and urgency are key criteria that must be identified and assessed on an ongoing basis. Regarding impact, it is critical to be able to determine if an alert is actionable, the severity based on impact, and the number of users or business services that are or will be affected. Urgency also comes into play. For example, does the problem need to be fixed right now, in the next hour, or in the next day?
It is difficult to always know what to monitor ahead of time, so you need at least enough context to figure out what’s going wrong when someone inevitably gets woken up in the middle of the night and needs to bring everything back online. Without this level of understanding, your team cannot parse what should be monitored and know when to grin and bear turning on an alert.
Read in-depth insights into how to optimize monitoring and alerting capabilities in a Kubernetes environment.
Learn more about Kubernetes Best Practices by visiting https://www.fairwinds.com/.