Guest post originally published on Particule’s blog by Kevin Lefevre, CTO & Co-founder at Particule

Introduction

In this article we are going to see the limitation of a Prometheus only monitoring stack and why moving to a Thanos based stack can improve metrics retention and also reduce overall infrastructure cost.

Content used for this demo is available here and there and are submit to their respective licenses.

Kubernetes Prometheus Stack

When deploying Kubernetes infrastructure for our customer, it is standard to deploy a monitoring stack on each cluster. This stack is often comprise of several components:

The simplify architecture is the following:

Kubernetes Prometheus Stack architecture

Caveats

This architecture has some caveats and does not scale out well when increasing the number of cluster from which you want to get metrics.

Multiple Grafana

In this setup each cluster has its own Grafana with its own set of dashboards which can be a pain to maintain.

Storing metrics is expensive

Prometheus stores metrics on disk, you have to make a choice between storage space and metric retention time. If you want to store data for a long time and running on a Cloud Provider, block storage can be expensive if you store terabyte of data on it. Also in production environnement Prometheus is often run either with replication or sharding or both, this can double or even quadruple your storage needs.

Solutions

Multiple Grafana Datasource

It is possible to expose Prometheus endpoints on the external network and to add them as Datasource in a single Grafana. You just need to implements security on the Prometheus external endpoints with mutual TLS or TLS and basic auth for example. The drawback of this solution is that you cannot make calculation based on different data sources.

Prometheus Federation

Prometheus federation allow scraping Prometheuses from Prometheus, this solution works well when you are not scraping a lot of metrics. At scale, if the scrape duration of all your Prometheus targets takes longer than the scrape interval, you might encounter some severe issues.

Prometheus remote write

While remote writing is a solution (and is also implemented by Thanos receiver), we will not discuss the “pushing metrics” part in this article. You can read about the pros and cons of pushing metrics here. It is recommended to push metrics as a last resort or when not trusting multiple cluster or tenant (for example when building a Prometheus as a service offering). Anyway this might be a topic for a further article but we will focus on the scrapping here.

Enter Thanos

Thanos

Thanos is an “Open source, highly available Prometheus setup with long term storage capabilities“. Thanos is used by a lot of well known companies. It is also a part of the CNCF incubating projects.

One of the main feature of Thanos is to allow for “unlimited” storage. How so ? By using object storage (such as S3), which is offered by almost every cloud provider. If running on premises, object storage can be offered with solution like rook or minio.

How does it work ?

Thanos is running alongside Prometheus. It is common to start with a Prometheus only setup and to upgrade to a Thanos one.

Thanos is split into several components, each having one goal (as every service should be 🙂 ). The component communicate with each other through gRPC.

Thanos Sidecar
Thanos sidecar running alongside Prometheus and export to an object storage

Thanos is running alongside Prometheus (with a sidecar) and export Prometheus metrics every 2h to an object storage. This allow Prometheus to be almost stateless. Prometheus is still keeping 2 hours worth of metrics in memory so you might still loose 2 hours worth of metrics in case of outage (this is problem which should be handle by your Prometheus setup, with HA/Sharding, and not by Thanos).

Thanos sidecar is available out of the box with Prometheus Operator and Kube Prometheus Stack and can be deploy easily. This component act as a store for Thanos Query.

Thanos Store

Thanos store acts as a gateway to translate query to remote object storage. It can also cache some information on local storage. Basically this is the component that allows you to query an object store for metrics. This component acts as a store for Thanos Query

Thanos Compactor

Thanos compactor is a singleton (it is not scalable) that is responsible for compacting and downsampling the metrics stored inside an object store. Downsampling is the action of loosing granularity on your metrics over time. For example you may want to keep your metrics for 2 or 3 year but you do not need so many data points as your metrics from yesterday. This is what the compactor is for, saving you byte on your object storage and therefore saving you $.

Thanos Query

Thanos Query is the main component of Thanos, it is the central point where you send promQL query to. Thanos query exposes a Prometheus compatible endpoints. Then it dispatches query to all of it “stores”. Keep in mind the store may be any other Thanos component that serves metrics. Thanos query can dispatch a query to:

Thanos query is also responsible for deduplicating the metrics if the same metrics come from different stores or Prometheuses. For example if you have a metric which is in a Prometheus and also inside an object store, Thanos query can deduplicate the metrics. Deduplication also works based on Prometheus replicas and shard in the case of a Prometheus HA setup.

Thanos Query Frontend

As hinted by its name, Thanos Query Frontend acts a frontend for Thanos Query, its goal is to split large query into multiple smaller queries and also to cache the query result (either in memory or in a memcached)

There are also other components such as Thanos Receive in the case of remote write but this is still not the topic of this article.

Multi Cluster Architecture

There are multiple way to deploy these components into multiple Kubernetes Cluster, some are better than the other depending on the use cases and we cannot be exhaustive here.

Multi cluster architecture

Our example is running on AWS with 2 clusters deployed with tEKS, our all in one solution to deploy production ready EKS clusters on AWS:

Our deployment uses the official kube-prometheus-stack and bitnami thanos chart.

Everything is curated inside our terraform-kubernetes-addons repository

The directory structure is the following inside the thanos demo folder:

 .
├──  env_tags.yaml
├──  eu-west-1
│  ├──  clusters
│  │  └──  observer
│  │     ├──  eks
│  │     │  ├──  kubeconfig
│  │     │  └──  terragrunt.hcl
│  │     ├──  eks-addons
│  │     │  └──  terragrunt.hcl
│  │     └──  vpc
│  │        └──  terragrunt.hcl
│  └──  region_values.yaml
└──  eu-west-3
   ├──  clusters
   │  └──  observee
   │     ├──  cluster_values.yaml
   │     ├──  eks
   │     │  ├──  kubeconfig
   │     │  └──  terragrunt.hcl
   │     ├──  eks-addons
   │     │  └──  terragrunt.hcl
   │     └──  vpc
   │        └──  terragrunt.hcl
   └──  region_values.yaml

This allow for DRY(Don’t Repeat Yourself) infrastructure and to easily scale the number of AWS accounts, regions and clusters.

Observer Cluster

The observer cluster is our primary cluster from which we are going to query the other clusters:

Prometheus-operator is running with:

kube-prometheus-stack = {
  enabled                     = true
  allowed_cidrs               = dependency.vpc.outputs.private_subnets_cidr_blocks
  thanos_sidecar_enabled      = true
  thanos_bucket_force_destroy = true
  extra_values                = <<-EXTRA_VALUES
    grafana:
      deploymentStrategy:
        type: Recreate
      ingress:
        enabled: true
        annotations:
          kubernetes.io/ingress.class: nginx
          cert-manager.io/cluster-issuer: "letsencrypt"
        hosts:
          - grafana.${local.default_domain_suffix}
        tls:
          - secretName: grafana.${local.default_domain_suffix}
            hosts:
              - grafana.${local.default_domain_suffix}
      persistence:
        enabled: true
        storageClassName: ebs-sc
        accessModes:
          - ReadWriteOnce
        size: 1Gi
    prometheus:
      prometheusSpec:
        replicas: 1
        retention: 2d
        retentionSize: "10GB"
        ruleSelectorNilUsesHelmValues: false
        serviceMonitorSelectorNilUsesHelmValues: false
        podMonitorSelectorNilUsesHelmValues: false
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: ebs-sc
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 10Gi
    EXTRA_VALUES

A CA is generated for the observer cluster:

Thanos component deployed:

Additionnal Thanos components deployed:

thanos-tls-querier = {
  "observee" = {
    enabled                 = true
    default_global_requests = true
    default_global_limits   = false
    stores = [
      "thanos-sidecar.${local.default_domain_suffix}:443"
    ]
  }
}

thanos-storegateway = {
  "observee" = {
    enabled                 = true
    default_global_requests = true
    default_global_limits   = false
    bucket                  = "thanos-store-pio-thanos-observee"
    region                  = "eu-west-3"
  }

Observee cluster

Observee clusters are Kubernetes clusters with minimal Prometheus/Thanos installation that are going to be queried by the Observer cluster

Prometheus operator is running with:

 kube-prometheus-stack = {
  enabled                     = true
  allowed_cidrs               = dependency.vpc.outputs.private_subnets_cidr_blocks
  thanos_sidecar_enabled      = true
  thanos_bucket_force_destroy = true
  extra_values                = <<-EXTRA_VALUES
    grafana:
      enabled: false
    prometheus:
      thanosIngress:
        enabled: true
        ingressClassName: nginx
        annotations:
          cert-manager.io/cluster-issuer: "letsencrypt"
          nginx.ingress.kubernetes.io/ssl-redirect: "true"
          nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
          nginx.ingress.kubernetes.io/auth-tls-verify-client: "on"
          nginx.ingress.kubernetes.io/auth-tls-secret: "monitoring/thanos-ca"
        hosts:
        - thanos-sidecar.${local.default_domain_suffix}
        paths:
        - /
        tls:
        - secretName: thanos-sidecar.${local.default_domain_suffix}
          hosts:
          - thanos-sidecar.${local.default_domain_suffix}
      prometheusSpec:
        replicas: 1
        retention: 2d
        retentionSize: "6GB"
        ruleSelectorNilUsesHelmValues: false
        serviceMonitorSelectorNilUsesHelmValues: false
        podMonitorSelectorNilUsesHelmValues: false
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: ebs-sc
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 10Gi
    EXTRA_VALUES

Thanos components deployed:

thanos = {
  enabled = true
  bucket_force_destroy = true
  trusted_ca_content      = dependency.thanos-ca.outputs.thanos_ca
  extra_values = <<-EXTRA_VALUES
    compactor:
      retentionResolution5m: 90d
    query:
      enabled: false
    queryFrontend:
      enabled: false
    storegateway:
      enabled: false
    EXTRA_VALUES
}

A little bit more diving

Let’s check what is running on our clusters. On the observer we have:

kubectl -n monitoring get pods
NAME                                                        READY   STATUS    RESTARTS   AGE
alertmanager-kube-prometheus-stack-alertmanager-0           2/2     Running   0          120m
kube-prometheus-stack-grafana-c8768466b-rd8wm               2/2     Running   0          120m
kube-prometheus-stack-kube-state-metrics-5cf575d8f8-x59rd   1/1     Running   0          120m
kube-prometheus-stack-operator-6856b9bb58-hdrb2             1/1     Running   0          119m
kube-prometheus-stack-prometheus-node-exporter-8hvmv        1/1     Running   0          117m
kube-prometheus-stack-prometheus-node-exporter-cwlfd        1/1     Running   0          120m
kube-prometheus-stack-prometheus-node-exporter-rsss5        1/1     Running   0          120m
kube-prometheus-stack-prometheus-node-exporter-rzgr9        1/1     Running   0          120m
prometheus-kube-prometheus-stack-prometheus-0               3/3     Running   1          120m
thanos-compactor-74784bd59d-vmvps                           1/1     Running   0          119m
thanos-query-7c74db546c-d7bp8                               1/1     Running   0          12m
thanos-query-7c74db546c-ndnx2                               1/1     Running   0          12m
thanos-query-frontend-5cbcb65b57-5sx8z                      1/1     Running   0          119m
thanos-query-frontend-5cbcb65b57-qjhxg                      1/1     Running   0          119m
thanos-storegateway-0                                       1/1     Running   0          119m
thanos-storegateway-1                                       1/1     Running   0          118m
thanos-storegateway-observee-storegateway-0                 1/1     Running   0          12m
thanos-storegateway-observee-storegateway-1                 1/1     Running   0          11m
thanos-tls-querier-observee-query-dfb9f79f9-4str8           1/1     Running   0          29m
thanos-tls-querier-observee-query-dfb9f79f9-xsq24           1/1     Running   0          29m

kubectl -n monitoring get ingress
NAME                            CLASS    HOSTS                                            ADDRESS                                                                         PORTS     AGE
kube-prometheus-stack-grafana   <none>   grafana.thanos.teks-tg.clusterfrak-dynamics.io   k8s-ingressn-ingressn-afa0a48374-f507283b6cd101c5.elb.eu-west-1.amazonaws.com   80, 443   123m

And on the observee :

kubectl -n monitoring get pods
NAME                                                        READY   STATUS    RESTARTS   AGE
alertmanager-kube-prometheus-stack-alertmanager-0           2/2     Running   0          39m
kube-prometheus-stack-kube-state-metrics-5cf575d8f8-ct292   1/1     Running   0          39m
kube-prometheus-stack-operator-6856b9bb58-4cngc             1/1     Running   0          39m
kube-prometheus-stack-prometheus-node-exporter-bs4wp        1/1     Running   0          39m
kube-prometheus-stack-prometheus-node-exporter-c57ss        1/1     Running   0          39m
kube-prometheus-stack-prometheus-node-exporter-cp5ch        1/1     Running   0          39m
kube-prometheus-stack-prometheus-node-exporter-tnqvq        1/1     Running   0          39m
kube-prometheus-stack-prometheus-node-exporter-z2p49        1/1     Running   0          39m
kube-prometheus-stack-prometheus-node-exporter-zzqp7        1/1     Running   0          39m
prometheus-kube-prometheus-stack-prometheus-0               3/3     Running   1          39m
thanos-compactor-7576dcbcfc-6pd4v                           1/1     Running   0          38m

kubectl -n monitoring get ingress
NAME                                   CLASS   HOSTS                                                   ADDRESS                                                                         PORTS     AGE
kube-prometheus-stack-thanos-gateway   nginx   thanos-sidecar.thanos.teks-tg.clusterfrak-dynamics.io   k8s-ingressn-ingressn-95903f6102-d2ce9013ac068b9e.elb.eu-west-3.amazonaws.com   80, 443   40m

Our TLS querier are supposed to be able to query the observee cluster for metrics. Let’s check their behavior:


k -n monitoring logs -f thanos-tls-querier-observee-query-687dd88ff5-nzpdh

level=info ts=2021-02-23T15:37:35.692346206Z caller=storeset.go:387 component=storeset msg="adding new storeAPI to query storeset" address=thanos-sidecar.thanos.teks-tg.clusterfrak-dynamics.io:443 extLset="{cluster=\"pio-thanos-observee\", prometheus=\"monitoring/kube-prometheus-stack-prometheus\", prometheus_replica=\"prometheus-kube-prometheus-stack-prometheus-0\"}"

So this querier pods can query my other cluster, if we check the webUI, we can see the stores:

kubectl -n monitoring port-forward thanos-tls-querier-observee-query-687dd88ff5-nzpdh 10902
Thanos MTLS UI

So great but I have only one store ! Remember when we said the querier can stack on top of each other. In our observer cluster we have standard http querier that can query other components like in the architecture drawing.

kubectl -n monitoring port-forward thanos-query-7c74db546c-d7bp8 10902

Here we can see all the store that have been added to our central querier:

Thanos query UI

Visualisation in Grafana

Finally we can head to Grafana and see how the default Kubernetes dashboard have been made compatible with multicluster.

Thanos multicluster UI

Conclusion

Thanos is a really complex system with a lot of moving parts, we did not deep dive on the specific custom configuration involved here as it would take too much time.

We offer a quite complete implementation for AWS in our tEKS repository that abstract a lot of the complexity (mostly the mTLS part) and allow to do lot of customizations. You can also use our terraform-kubernetes-addons module as a standalone component. We plan to support other cloud provider in the future. Don’t hesitate to contact us through Github Issues on either one of this projects.

There are many possible Thanos implementations that might suit you depending on your infrastructure and your requirements.

If you want to dive deeper into Thanos you can check their official kube-thanos repository and also their recommendation about cross cluster communication

And of course, we are happy to help you setup your cloud native monitoring stack, contact us at contact@particule.io 🙂

You can also reach us every day on the CNCF/Kubernetes Slack channels.