Published: Feb 22, 2024 by Isaac Johnson

Recently I was looking to fetch some metrics from the Prometheus in Kubecost and found it was down. Today we’ll look at how you fix a stuck Prometheus server, upgrading Kubecost to the latest chart with Helm and then some of the newer alerting features. We’ll use email, Slack and PagerDuty for that.

Fixing

First, I had found the Kubecost Prometheus metrics collector had been down.

Checking into why, I found the Prometheus server was broken due to a lock file getting stuck

$ kubectl logs kubecost-prometheus-server-8697c46bc-mwp8z -n kubecost  prometheus-server | tail -n 10
ts=2024-01-27T15:29:33.698Z caller=main.go:833 level=info msg="Stopping notify discovery manager..."
ts=2024-01-27T15:29:33.698Z caller=main.go:855 level=info msg="Stopping scrape manager..."
ts=2024-01-27T15:29:33.698Z caller=main.go:829 level=info msg="Notify discovery manager stopped"
ts=2024-01-27T15:29:33.698Z caller=main.go:815 level=info msg="Scrape discovery manager stopped"
ts=2024-01-27T15:29:33.698Z caller=manager.go:950 level=info component="rule manager" msg="Stopping rule manager..."
ts=2024-01-27T15:29:33.698Z caller=manager.go:960 level=info component="rule manager" msg="Rule manager stopped"
ts=2024-01-27T15:29:33.698Z caller=notifier.go:600 level=info component=notifier msg="Stopping notification manager..."
ts=2024-01-27T15:29:33.698Z caller=main.go:1088 level=info msg="Notifier manager stopped"
ts=2024-01-27T15:29:33.698Z caller=main.go:849 level=info msg="Scrape manager stopped"
ts=2024-01-27T15:29:33.698Z caller=main.go:1097 level=error err="opening storage failed: lock DB directory: resource temporarily unavailable"

To fix it, I need to get onto that volume (storage-volume)

$ kubectl get deployments -n kubecost kubecost-prometheus-server -o yaml | tail -n50 | head -n33
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 30
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/config
          name: config-volume
        - mountPath: /data
          name: storage-volume
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1001
        runAsGroup: 1001
        runAsNonRoot: true
        runAsUser: 1001
      serviceAccount: kubecost-prometheus-server
      serviceAccountName: kubecost-prometheus-server
      terminationGracePeriodSeconds: 300
      volumes:
      - configMap:
          defaultMode: 420
          name: kubecost-prometheus-server
        name: config-volume
      - name: storage-volume
        persistentVolumeClaim:
          claimName: kubecost-prometheus-server
status:

While, I can’t exec into a crashing pod to fix

$ kubectl exec -it kubecost-prometheus-server-8697c46bc-mwp8z -n kubecost --container prometheus-server -- /bin/bash
error: unable to upgrade connection: container not found ("prometheus-server")

I can, however, make a debug pod

$ cat debug.yaml
apiVersion: v1
kind: Pod
metadata:
  name: ubuntu-pod
spec:
  containers:
  - name: ubuntu-container
    image: ubuntu
    command: ["/bin/sleep", "3600"]
    command: 
    volumeMounts:
    - name: kps-volume
      mountPath: /mnt/kps
    - name: kpsvc-volume
      mountPath: /mnt/kpsvc
  volumes:
  - name: kps-volume
    configMap:
      name: kubecost-prometheus-server
  - name: kpsvc-volume
    persistentVolumeClaim:
      claimName: kubecost-prometheus-server

Launch it into the namespace

$ kubectl apply -f ./debug.yaml -n kubecost
pod/ubuntu-pod created

And then I can exec into it and remove the lock

builder@DESKTOP-QADGF36:~$ kubectl exec -it ubuntu-pod -n kubecost -- /bin/bash
root@ubuntu-pod:/# ls -ltra /mnt/kps
total 12
lrwxrwxrwx 1 root root   12 Jan 27 15:39 rules -> ..data/rules
lrwxrwxrwx 1 root root   26 Jan 27 15:39 recording_rules.yml -> ..data/recording_rules.yml
lrwxrwxrwx 1 root root   21 Jan 27 15:39 prometheus.yml -> ..data/prometheus.yml
lrwxrwxrwx 1 root root   13 Jan 27 15:39 alerts -> ..data/alerts
lrwxrwxrwx 1 root root   25 Jan 27 15:39 alerting_rules.yml -> ..data/alerting_rules.yml
lrwxrwxrwx 1 root root   32 Jan 27 15:39 ..data -> ..2024_01_27_15_39_55.3698848498
drwxr-xr-x 2 root root 4096 Jan 27 15:39 ..2024_01_27_15_39_55.3698848498
drwxrwxrwx 3 root root 4096 Jan 27 15:39 .
drwxr-xr-x 1 root root 4096 Jan 27 15:40 ..
root@ubuntu-pod:/# ls -ltra /mnt/kps
kps/   kpsvc/
root@ubuntu-pod:/# ls -ltra /mnt/kpsvc/
total 120
-rw-r--r--  1 1001 1001    0 Aug  1 20:57 lock
drwxr-xr-x  3 1001 1001 4096 Dec 16 23:00 01HHTDQMSYP3DFG77GAMCSTR4E
drwxr-xr-x  3 1001 1001 4096 Dec 17 17:00 01HHWBH1K67M2YMHXNNBR8G19X
drwxr-xr-x  3 1001 1001 4096 Dec 18 11:00 01HHY9APK63H6KRASAFV37KQC5
drwxr-xr-x  3 1001 1001 4096 Dec 19 05:00 01HJ0745GH4KKR6Z0GHTTXH98H
drwxr-xr-x  3 1001 1001 4096 Dec 19 23:00 01HJ24XRMSP1QZ025K0T4RS4BR
drwxr-xr-x  3 1001 1001 4096 Dec 20 17:00 01HJ42Q73PRYGXWKPB59DV9DE3
drwxr-xr-x  3 1001 1001 4096 Dec 21 11:00 01HJ60GTXNPQYEAQ2PKG767CE3
drwxr-xr-x  3 1001 1001 4096 Dec 22 05:00 01HJ7YA9H8WGHF8J3T3BY1VGTK
drwxr-xr-x  3 1001 1001 4096 Dec 22 23:00 01HJ9W3YPB3K1RHT8CDAA77X8J
drwxr-xr-x  3 1001 1001 4096 Dec 23 17:00 01HJBSXFTNX4MWRD3AZCZBGJ83
drwxr-xr-x  3 1001 1001 4096 Dec 24 11:00 01HJDQQ62R6JRVCNG4J4C8FKYJ
drwxr-xr-x  3 1001 1001 4096 Dec 25 05:00 01HJFNGJ031MNESX2JYAS67R0G
drwxr-xr-x  3 1001 1001 4096 Dec 25 23:00 01HJHKA8VJD4R7Q1NX151P260K
drwxr-xr-x  3 1001 1001 4096 Dec 26 17:00 01HJKH3PA91A8TEMFFN4M0TGRF
drwxr-xr-x  3 1001 1001 4096 Dec 27 11:00 01HJNEXDBT8DQ83YR2ECS32KWK
drwxr-xr-x  3 1001 1001 4096 Dec 28 05:00 01HJQCPQ276G5ENYR82KXM7ZS3
drwxr-xr-x  3 1001 1001 4096 Dec 28 23:00 01HJSAGC54R2AWAXK8Q2CXGP28
drwxr-xr-x  3 1001 1001 4096 Dec 29 17:00 01HJV89SJTXRXGXMPSY5PJBB24
drwxr-xr-x  3 1001 1001 4096 Dec 30 11:00 01HJX63HQ48JYCHWRK7ANS00R3
drwxr-xr-x  3 1001 1001 4096 Dec 30 17:00 01HJXTP8WMSNZZ3MSK4ZXEN0D4
drwxr-xr-x  3 1001 1001 4096 Dec 30 21:00 01HJY8DKG3P63D3T8D3XYQZ26D
drwxr-xr-x  3 1001 1001 4096 Dec 30 23:00 01HJYF9AQPNRWS7RJNVYDJ0BFG
drwxr-xr-x  3 1001 1001 4096 Dec 30 23:00 01HJYF9MGWEMYSB8HPTWGM87KH
drwxr-xr-x  3 1001 1001 4096 Dec 31 01:00 01HJYP520BRZMD7GRWM0H7D1YB
drwxr-xr-x  3 1001 1001 4096 Dec 31 03:00 01HJYX0SANQ018ZWYTAPDAC0K7
drwxrwxrwx 29 root root 4096 Dec 31 03:00 .
drwxr-xr-x  2 1001 1001 4096 Dec 31 03:00 chunks_head
drwxr-xr-x  3 1001 1001 4096 Dec 31 03:00 wal
-rw-r--r--  1 1001 1001 1001 Jan 27 15:39 queries.active
drwxr-xr-x  1 root root 4096 Jan 27 15:40 ..
root@ubuntu-pod:/# rm -f /mnt/kpsvc/lock
root@ubuntu-pod:/# ls -ltra /mnt/kpsvc/
total 120
drwxr-xr-x  3 1001 1001 4096 Dec 16 23:00 01HHTDQMSYP3DFG77GAMCSTR4E
drwxr-xr-x  3 1001 1001 4096 Dec 17 17:00 01HHWBH1K67M2YMHXNNBR8G19X
drwxr-xr-x  3 1001 1001 4096 Dec 18 11:00 01HHY9APK63H6KRASAFV37KQC5
drwxr-xr-x  3 1001 1001 4096 Dec 19 05:00 01HJ0745GH4KKR6Z0GHTTXH98H
drwxr-xr-x  3 1001 1001 4096 Dec 19 23:00 01HJ24XRMSP1QZ025K0T4RS4BR
drwxr-xr-x  3 1001 1001 4096 Dec 20 17:00 01HJ42Q73PRYGXWKPB59DV9DE3
drwxr-xr-x  3 1001 1001 4096 Dec 21 11:00 01HJ60GTXNPQYEAQ2PKG767CE3
drwxr-xr-x  3 1001 1001 4096 Dec 22 05:00 01HJ7YA9H8WGHF8J3T3BY1VGTK
drwxr-xr-x  3 1001 1001 4096 Dec 22 23:00 01HJ9W3YPB3K1RHT8CDAA77X8J
drwxr-xr-x  3 1001 1001 4096 Dec 23 17:00 01HJBSXFTNX4MWRD3AZCZBGJ83
drwxr-xr-x  3 1001 1001 4096 Dec 24 11:00 01HJDQQ62R6JRVCNG4J4C8FKYJ
drwxr-xr-x  3 1001 1001 4096 Dec 25 05:00 01HJFNGJ031MNESX2JYAS67R0G
drwxr-xr-x  3 1001 1001 4096 Dec 25 23:00 01HJHKA8VJD4R7Q1NX151P260K
drwxr-xr-x  3 1001 1001 4096 Dec 26 17:00 01HJKH3PA91A8TEMFFN4M0TGRF
drwxr-xr-x  3 1001 1001 4096 Dec 27 11:00 01HJNEXDBT8DQ83YR2ECS32KWK
drwxr-xr-x  3 1001 1001 4096 Dec 28 05:00 01HJQCPQ276G5ENYR82KXM7ZS3
drwxr-xr-x  3 1001 1001 4096 Dec 28 23:00 01HJSAGC54R2AWAXK8Q2CXGP28
drwxr-xr-x  3 1001 1001 4096 Dec 29 17:00 01HJV89SJTXRXGXMPSY5PJBB24
drwxr-xr-x  3 1001 1001 4096 Dec 30 11:00 01HJX63HQ48JYCHWRK7ANS00R3
drwxr-xr-x  3 1001 1001 4096 Dec 30 17:00 01HJXTP8WMSNZZ3MSK4ZXEN0D4
drwxr-xr-x  3 1001 1001 4096 Dec 30 21:00 01HJY8DKG3P63D3T8D3XYQZ26D
drwxr-xr-x  3 1001 1001 4096 Dec 30 23:00 01HJYF9AQPNRWS7RJNVYDJ0BFG
drwxr-xr-x  3 1001 1001 4096 Dec 30 23:00 01HJYF9MGWEMYSB8HPTWGM87KH
drwxr-xr-x  3 1001 1001 4096 Dec 31 01:00 01HJYP520BRZMD7GRWM0H7D1YB
drwxr-xr-x  3 1001 1001 4096 Dec 31 03:00 01HJYX0SANQ018ZWYTAPDAC0K7
drwxr-xr-x  2 1001 1001 4096 Dec 31 03:00 chunks_head
drwxr-xr-x  3 1001 1001 4096 Dec 31 03:00 wal
-rw-r--r--  1 1001 1001 1001 Jan 27 15:39 queries.active
drwxr-xr-x  1 root root 4096 Jan 27 15:40 ..
drwxrwxrwx 29 root root 4096 Jan 27 15:41 .

That fixed it

$ kubectl get pods -n kubecost
NAME                                           READY   STATUS    RESTARTS       AGE
kubecost-prometheus-node-exporter-5nnlx        1/1     Running   5 (72d ago)    323d
kubecost-prometheus-node-exporter-b6gbr        1/1     Running   7 (122d ago)   323d
kubecost-cost-analyzer-6b4bd74fc9-pmrw6        2/2     Running   0              253d
kubecost-prometheus-node-exporter-2wsf2        1/1     Running   2 (253d ago)   323d
kubecost-grafana-8496485545-655mb              2/2     Running   4 (253d ago)   302d
kubecost-kube-state-metrics-59fd4555f4-lw5b4   1/1     Running   2 (253d ago)   302d
kubecost-prometheus-node-exporter-f9w74        1/1     Running   14 (27d ago)   323d
kubecost-prometheus-server-8697c46bc-bs582     2/2     Running   0              10m

With it restored, I can now check costs. The days Prometheus was down, there is no data

I see notifications that updates are available

And I see the same in settings, which shows we are at version 1.100.2

Which lines up with our chart

$ helm list -n kubecost
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
kubecost        kubecost        1               2023-03-09 16:31:25.359127784 -0600 CST deployed        cost-analyzer-1.100.2   1.100.2

Since I’m not using any settings

$ helm get values -n kubecost kubecost
USER-SUPPLIED VALUES:
null

I’ll just try a simple update and upgrade

$ helm repo update

$ helm upgrade kubecost -n kubecost kubecost/cost-analyzer
W0128 12:03:29.164623   26752 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0128 12:03:30.589349   26752 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0128 12:03:30.665985   26752 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0128 12:03:31.718732   26752 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
Release "kubecost" has been upgraded. Happy Helming!
NAME: kubecost
LAST DEPLOYED: Sun Jan 28 12:03:21 2024
NAMESPACE: kubecost
STATUS: deployed
REVISION: 2
NOTES:
--------------------------------------------------
Kubecost 1.108.1 has been successfully installed.

Please allow 5-10 minutes for Kubecost to gather metrics.

When configured, cost reconciliation with cloud provider billing data will have a 48 hour delay.

When pods are Ready, you can enable port-forwarding with the following command:

    kubectl port-forward --namespace kubecost deployment/kubecost-cost-analyzer 9090

Then, navigate to http://localhost:9090 in a web browser.

Having installation issues? View our Troubleshooting Guide at http://docs.kubecost.com/troubleshoot-install

I see this brought us up to version 1.108.1

builder@DESKTOP-QADGF36:~$ kubectl get pods -n kubecost
NAME                                          READY   STATUS    RESTARTS       AGE
kubecost-grafana-8496485545-655mb             2/2     Running   4 (254d ago)   303d
kubecost-prometheus-node-exporter-p9nr6       1/1     Running   0              31s
kubecost-cost-analyzer-657b9686fb-5zr7q       0/2     Running   0              35s
kubecost-grafana-956b98df8-kwpq8              1/2     Running   0              37s
kubecost-prometheus-node-exporter-h8tgd       1/1     Running   0              20s
kubecost-prometheus-server-7c6b558d97-pws4m   1/2     Running   0              31s
kubecost-prometheus-node-exporter-dvqpq       1/1     Running   0              12s
kubecost-prometheus-node-exporter-dtc9x       1/1     Running   0              5s
builder@DESKTOP-QADGF36:~$ helm list -n kubecost
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
kubecost        kubecost        2               2024-01-28 12:03:21.461670211 -0600 CST deployed        cost-analyzer-1.108.1   1.108.1

This looks pretty similar

Even though this is an on-prem cluster which means the only real cost I have is electricity (and most of the time my solar panels negate that), I do find cost details helpful.

For instance, looking at the “right-sizing” of containers can help us tweak our Requests/Limits to better fit the cluster.

Another great report is looking at the Cumulative costs by CPU. This shows me what namespaces are chewing up resources. At least once I used this to find an old “idea” I left running that was burning cycles unnecessarily.

In this latest release they’ve added a button you cannot use without an enterprise license. I find this rather annoying. This feels like Ad-ware

And of course, if I chose to “Upgrade” it get no prices (boooo!!!!)

Though, if you chose the SaaS option, there are prices, and I should add they have added features like budgets/alerts to free which is nice

Alerts

Let’s add an alert for Cluster Health Monitoring

Here, when I test it, I can see outputs to Slack

and email

I was a bit curious how they were emailing and it seems they use Sendgrid

The Kubecost Health Monitoring is more about notifying when it cannot connect to its storage. I’m not sure if that is just Prom or Prom and its PVCs. Either way, I see that as an Infra outage and should treat it as such.

I’ll grab the email for my OnPremAlerts in PagerDuty

And use that in the settings

I was a bit thrown by getting a PagerDuty alert about 15 minutes later. When I went to check, it seems to have suggested Kubecost “added” a new cluster.

I’ll be keeping an eye on this in case it ends up being noisy (in which case it cannot use PD which is my ring-my-phone option).

Update

In the weeks that followed, I only got paged a could times. For instance, when a cluster came back online

Looking at a few days of data, we can see it’s pretty consistent on usage

However, we can only look at two weeks of data due to a paywall

This does not bother me too much as I also run OpenCost

Doing a port forward to that

$ kubectl port-forward svc/opencost 9090:9090
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090
Handling connection for 9090
Handling connection for 9090
Handling connection for 9090
Handling connection for 9090
Handling connection for 9090

Can fetch data as far back as it has collected

Presently, I’m running version 1.103.1 from chart 1.14.0

$ helm list | grep open
opencost                                default         1               2023-05-17 09:12:38.435624112 -0500 CDT deployed        opencost-1.14.0                 1.103.1
otelcol                                 default         5               2023-01-11 07:24:33.952482006 -0600 CST deployed        opentelemetry-collector-0.43.2  0.67.0

What is clever about it, though, is it’s really just fetching the data from the Kubecost Prometheus (so I didn’t need two collectors)

$ helm get values opencost
USER-SUPPLIED VALUES:
opencost:
  exporter:
    image:
      registry: gcr.io
      repository: kubecost1/opencost
      tag: kc-eu-2023
  prometheus:
    external:
      enabled: true
      url: http://kubecost-prometheus-server.kubecost.svc
    internal:
      enabled: false

Let’s upgrade to the latest and see if there are new features

$ helm repo add opencost-charts https://opencost.github.io/opencost-helm-chart
"opencost-charts" has been added to your repositories

$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "opencost-charts" chart repository
...Successfully got an update from the "opencost" chart repository
...Successfully got an update from the "azure-samples" chart repository
...Successfully got an update from the "jfelten" chart repository
...Successfully got an update from the "kuma" chart repository
...Successfully got an update from the "confluentinc" chart repository
...Successfully got an update from the "sonarqube" chart repository
...Successfully got an update from the "jetstack" chart repository
...Successfully got an update from the "makeplane" chart repository
...Successfully got an update from the "actions-runner-controller" chart repository
...Successfully got an update from the "gitea-charts" chart repository
...Successfully got an update from the "lifen-charts" chart repository
...Successfully got an update from the "rook-release" chart repository
...Successfully got an update from the "portainer" chart repository
...Successfully got an update from the "crossplane-stable" chart repository
...Successfully got an update from the "akomljen-charts" chart repository
...Successfully got an update from the "elastic" chart repository
...Successfully got an update from the "rancher-latest" chart repository
...Successfully got an update from the "rhcharts" chart repository
...Successfully got an update from the "nginx-stable" chart repository
...Successfully got an update from the "zabbix-community" chart repository
...Successfully got an update from the "sumologic" chart repository
...Successfully got an update from the "adwerx" chart repository
...Successfully got an update from the "harbor" chart repository
...Successfully got an update from the "kube-state-metrics" chart repository
...Successfully got an update from the "nfs" chart repository
...Successfully got an update from the "ngrok" chart repository
...Successfully got an update from the "dapr" chart repository
...Successfully got an update from the "datadog" chart repository
...Unable to get an update from the "freshbrewed" chart repository (https://harbor.freshbrewed.science/chartrepo/library):
        failed to fetch https://harbor.freshbrewed.science/chartrepo/library/index.yaml : 404 Not Found
...Unable to get an update from the "myharbor" chart repository (https://harbor.freshbrewed.science/chartrepo/library):
        failed to fetch https://harbor.freshbrewed.science/chartrepo/library/index.yaml : 404 Not Found
...Successfully got an update from the "btungut" chart repository
...Successfully got an update from the "novum-rgi-helm" chart repository
...Successfully got an update from the "ingress-nginx" chart repository
...Successfully got an update from the "openfunction" chart repository
...Successfully got an update from the "longhorn" chart repository
...Successfully got an update from the "hashicorp" chart repository
...Successfully got an update from the "kiwigrid" chart repository
...Successfully got an update from the "incubator" chart repository
...Successfully got an update from the "spacelift" chart repository
...Successfully got an update from the "castai-helm" chart repository
...Successfully got an update from the "kubecost" chart repository
...Successfully got an update from the "signoz" chart repository
...Successfully got an update from the "openzipkin" chart repository
...Successfully got an update from the "open-telemetry" chart repository
...Successfully got an update from the "argo-cd" chart repository
...Successfully got an update from the "ananace-charts" chart repository
...Successfully got an update from the "uptime-kuma" chart repository
...Successfully got an update from the "newrelic" chart repository
...Successfully got an update from the "gitlab" chart repository
...Successfully got an update from the "grafana" chart repository
...Successfully got an update from the "bitnami" chart repository
...Successfully got an update from the "prometheus-community" chart repository
...Unable to get an update from the "epsagon" chart repository (https://helm.epsagon.com):
        Get "https://helm.epsagon.com/index.yaml": dial tcp: lookup helm.epsagon.com on 172.22.64.1:53: server misbehaving
Update Complete. ⎈Happy Helming!⎈

I did a quick peek at the current helm values for the chart to be sure I didn’t miss anything, but once confirmed, upgrading was as simply as dumping my current valeus and doing a helm upgrade

$ helm get values opencost -o yaml > opencost.values.yaml
$ helm upgrade opencost opencost-charts/opencost -f ./opencost.values.yaml
Release "opencost" has been upgraded. Happy Helming!
NAME: opencost
LAST DEPLOYED: Wed Feb  7 06:26:09 2024
NAMESPACE: default
STATUS: deployed
REVISION: 2
TEST SUITE: None

I actually caught a quick issue. I had set the specific image in my values

$ helm get values opencost -o yaml
opencost:
  exporter:
    image:
      registry: gcr.io
      repository: kubecost1/opencost
      tag: kc-eu-2023
  prometheus:
    external:
      enabled: true
      url: http://kubecost-prometheus-server.kubecost.svc
    internal:
      enabled: false

The latest is actually over in quay.io not GCR.

I updated to the proper image:

$ cat opencost.values.yaml
opencost:
  exporter:
    image:
      registry: quay.io
      repository: kubecost1/kubecost-cost-model
      tag: ""
  prometheus:
    external:
      enabled: true
      url: http://kubecost-prometheus-server.kubecost.svc
    internal:
      enabled: false

This time it took a few minutes to come up

$ kubectl get pod -l app.kubernetes.io/name=opencost
NAME                       READY   STATUS    RESTARTS   AGE
opencost-cb984b766-5n8mn   2/2     Running   0          3m20s

I think it was due to some old data on unmounted PVCs (from a prior Gitea)

$ kubectl logs opencost-cb984b766-5n8mn
Defaulted container "opencost" out of: opencost, opencost-ui
2024-02-07T12:30:26.11723109Z ??? Log level set to info
2024-02-07T12:30:26.117937899Z INF Starting cost-model version 1.108.0 (25d0064)
2024-02-07T12:30:26.117999347Z INF Prometheus/Thanos Client Max Concurrency set to 5
2024-02-07T12:30:26.133044116Z INF Success: retrieved the 'up' query against prometheus at: http://kubecost-prometheus-server.kubecost.svc
2024-02-07T12:30:26.137764604Z INF Retrieved a prometheus config file from: http://kubecost-prometheus-server.kubecost.svc
2024-02-07T12:30:26.141746906Z INF Using scrape interval of 60.000000
2024-02-07T12:30:26.142502041Z INF NAMESPACE: kubecost
2024-02-07T12:30:27.043696086Z INF Done waiting
2024-02-07T12:30:27.044013503Z INF Starting *v1.Deployment controller
2024-02-07T12:30:27.044253443Z INF Starting *v1.Namespace controller
2024-02-07T12:30:27.044303058Z INF Starting *v1.StatefulSet controller
2024-02-07T12:30:27.044337145Z INF Starting *v1.ReplicaSet controller
2024-02-07T12:30:27.044349672Z INF Starting *v1.Service controller
2024-02-07T12:30:27.044365081Z INF Starting *v1.PersistentVolume controller
2024-02-07T12:30:27.044394376Z INF Starting *v1.Node controller
2024-02-07T12:30:27.044396972Z INF Starting *v1.PersistentVolumeClaim controller
2024-02-07T12:30:27.04442263Z INF Starting *v1.Pod controller
2024-02-07T12:30:27.044424583Z INF Starting *v1.StorageClass controller
2024-02-07T12:30:27.044448724Z INF Starting *v1.ConfigMap controller
2024-02-07T12:30:27.044451017Z INF Starting *v1.Job controller
2024-02-07T12:30:27.044476637Z INF Starting *v1.ReplicationController controller
2024-02-07T12:30:27.04450075Z INF Starting *v1beta1.PodDisruptionBudget controller
2024-02-07T12:30:27.044673376Z INF Starting *v1.DaemonSet controller
2024-02-07T12:30:27.065100155Z INF Unsupported provider, falling back to default
2024-02-07T12:30:27.182644557Z INF No pricing-configs configmap found at install time, using existing configs: configmaps "pricing-configs" not found
2024-02-07T12:30:27.296704924Z INF No metrics-config configmap found at install time, using existing configs: configmaps "metrics-config" not found
2024-02-07T12:30:27.297392041Z INF Init: AggregateCostModel cache warming enabled
2024-02-07T12:30:27.297503955Z INF EXPORT_CSV_FILE is not set, CSV export is disabled
2024-02-07T12:30:27.297839364Z INF aggregation: cache warming defaults: 1d::::::::namespace::::weighted:false:false:true
2024-02-07T12:30:27.29790814Z INF ComputeAggregateCostModel: missed cache: 1d:1m:1.000000h:false (found false, disableAggregateCostModelCache true, noCache false)
2024-02-07T12:30:27.297977894Z WRN Failed to locate default region
2024-02-07T12:30:27.310657258Z WRN Failed to load 'name' field for ClusterInfo
2024-02-07T12:30:27.880785727Z WRN Metric emission: error getting LoadBalancer cost: strconv.ParseFloat: parsing "": invalid syntax
2024-02-07T12:30:27.880861288Z INF ComputeAggregateCostModel: setting L2 cache: 1d:1m:1.000000h:false
2024-02-07T12:30:27.883674424Z ERR Metric emission: failed to delete RAMAllocation with labels: [giteatest unmounted-pvs unmounted-pvs  ]
2024-02-07T12:30:27.883844951Z ERR Metric emission: failed to delete CPUAllocation with labels: [giteatest unmounted-pvs unmounted-pvs  ]
2024-02-07T12:30:27.883912753Z ERR Metric emission: failed to delete GPUAllocation with labels: [giteatest unmounted-pvs unmounted-pvs  ]
2024-02-07T12:30:27.884010662Z ERR Metric emission: failed to delete RAMAllocation with labels: [default unmounted-pvs unmounted-pvs  ]
2024-02-07T12:30:27.884064128Z ERR Metric emission: failed to delete CPUAllocation with labels: [default unmounted-pvs unmounted-pvs  ]
2024-02-07T12:30:27.884107738Z ERR Metric emission: failed to delete GPUAllocation with labels: [default unmounted-pvs unmounted-pvs  ]
2024-02-07T12:30:27.894369853Z INF ComputeAggregateCostModel: setting aggregate cache: 1d::::::::namespace::::weighted:false:false:true
2024-02-07T12:30:28.04384772Z INF caching 1d cluster costs for 11m0s
2024-02-07T12:30:28.043893709Z INF aggregation: warm cache: 1d
2024-02-07T12:30:28.299257787Z INF Flushing cost data caches: :0%
2024-02-07T12:31:27.929712188Z WRN No request or usage data found during CPU allocation calculation. Setting allocation to 0.
2024-02-07T12:31:27.930575686Z WRN No request or usage data found during CPU allocation calculation. Setting allocation to 0.
2024-02-07T12:31:27.939911926Z WRN Metric emission: error getting LoadBalancer cost: strconv.ParseFloat: parsing "": invalid syntax
2024-02-07T12:31:27.941441788Z ERR Metric emission: failed to delete RAMAllocation with labels: [giteatest unmounted-pvs unmounted-pvs  ]
2024-02-07T12:31:27.941468369Z ERR Metric emission: failed to delete CPUAllocation with labels: [giteatest unmounted-pvs unmounted-pvs  ]
2024-02-07T12:31:27.941478248Z ERR Metric emission: failed to delete GPUAllocation with labels: [giteatest unmounted-pvs unmounted-pvs  ]
2024-02-07T12:31:27.941522574Z ERR Metric emission: failed to delete RAMAllocation with labels: [default unmounted-pvs unmounted-pvs  ]
2024-02-07T12:31:27.941546939Z ERR Metric emission: failed to delete CPUAllocation with labels: [default unmounted-pvs unmounted-pvs  ]
2024-02-07T12:31:27.941569898Z ERR Metric emission: failed to delete GPUAllocation with labels: [default unmounted-pvs unmounted-pvs  ]
2024-02-07T12:32:27.970317657Z WRN Metric emission: error getting LoadBalancer cost: strconv.ParseFloat: parsing "": invalid syntax
2024-02-07T12:32:27.972171883Z ERR Metric emission: failed to delete RAMAllocation with labels: [default unmounted-pvs unmounted-pvs  ]
2024-02-07T12:32:27.97220194Z ERR Metric emission: failed to delete CPUAllocation with labels: [default unmounted-pvs unmounted-pvs  ]
2024-02-07T12:32:27.972213251Z ERR Metric emission: failed to delete GPUAllocation with labels: [default unmounted-pvs unmounted-pvs  ]
2024-02-07T12:32:27.972243882Z ERR Metric emission: failed to delete RAMAllocation with labels: [giteatest unmounted-pvs unmounted-pvs  ]
2024-02-07T12:32:27.972266024Z ERR Metric emission: failed to delete CPUAllocation with labels: [giteatest unmounted-pvs unmounted-pvs  ]
2024-02-07T12:32:27.972286617Z ERR Metric emission: failed to delete GPUAllocation with labels: [giteatest unmounted-pvs unmounted-pvs  ]

The new UI isn’t that much different. It does have a faster hover-over and a new “Cloud Costs” section

Which needs to be configured to be usable.

Recommendations

One of the nice features I noticed (once I had collected some new data) was that the Workload Savings recommendations have a link out to Grafana where we can see the specifics behind the request.

For instance, it recommends I right-size my wordpress instance and when I click “Grafana”, I can see recent stats on it:

I tried to load Grafana directly

$ kubectl port-forward svc/kubecost-grafana -n kubecost 8080:80
Forwarding from 127.0.0.1:8080 -> 3000
Forwarding from [::1]:8080 -> 3000
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
E0207 06:41:58.250350   11112 portforward.go:381] error copying from remote stream to local connection: readfrom tcp6 [::1]:8080->[::1]:48836: write tcp6 [::1]:8080->[::1]:48836: write: broken pipe
E0207 06:41:58.254090   11112 portforward.go:381] error copying from remote stream to local connection: readfrom tcp6 [::1]:8080->[::1]:48838: write tcp6 [::1]:8080->[::1]:48838: write: broken pipe
Handling connection for 8080
Handling connection for 8080
E0207 06:41:58.297065   11112 portforward.go:381] error copying from remote stream to local connection: readfrom tcp6 [::1]:8080->[::1]:48858: write tcp6 [::1]:8080->[::1]:48858: write: broken pipe
E0207 06:41:58.300948   11112 portforward.go:381] error copying from remote stream to local connection: readfrom tcp6 [::1]:8080->[::1]:48850: write tcp6 [::1]:8080->[::1]:48850: write: broken pipe
E0207 06:41:58.319769   11112 portforward.go:381] error copying from remote stream to local connection: readfrom tcp6 [::1]:8080->[::1]:48862: write tcp6 [::1]:8080->[::1]:48862: write: broken pipe
E0207 06:41:58.319819   11112 portforward.go:381] error copying from remote stream to local connection: readfrom tcp6 [::1]:8080->[::1]:48848: write tcp6 [::1]:8080->[::1]:48848: write: broken pipe

E0207 06:42:28.231252   11112 portforward.go:347] error creating error stream for port 8080 -> 3000: Timeout occurred


$ kubectl port-forward kubecost-grafana-956b98df8-kwpq8 -n kubecost 3033:3000
Forwarding from 127.0.0.1:3033 -> 3000
Forwarding from [::1]:3033 -> 3000
Handling connection for 3033
Handling connection for 3033
Handling connection for 3033
Handling connection for 3033
Handling connection for 3033
Handling connection for 3033
Handling connection for 3033
E0207 06:43:30.255917   11438 portforward.go:381] error copying from remote stream to local connection: readfrom tcp6 [::1]:3033->[::1]:41866: write tcp6 [::1]:3033->[::1]:41866: write: broken pipe
Handling connection for 3033
E0207 06:43:30.262927   11438 portforward.go:381] error copying from remote stream to local connection: readfrom tcp6 [::1]:3033->[::1]:41874: write tcp6 [::1]:3033->[::1]:41874: write: broken pipe
Handling connection for 3033
E0207 06:43:30.307733   11438 portforward.go:381] error copying from remote stream to local connection: readfrom tcp6 [::1]:3033->[::1]:41898: write tcp6 [::1]:3033->[::1]:41898: write: broken pipe
E0207 06:43:30.313134   11438 portforward.go:381] error copying from remote stream to local connection: readfrom tcp6 [::1]:3033->[::1]:41900: write tcp6 [::1]:3033->[::1]:41900: write: broken pipe
E0207 06:43:30.316854   11438 portforward.go:381] error copying from remote stream to local connection: readfrom tcp6 [::1]:3033->[::1]:41912: write tcp6 [::1]:3033->[::1]:41912: write: broken pipe
E0207 06:44:00.251398   11438 portforward.go:347] error creating error stream for port 3033 -> 3000: Timeout occurred
E0207 06:44:00.259830   11438 portforward.go:347] error creating error stream for port 3033 -> 3000: Timeout occurred
E0207 06:44:00.305362   11438 portforward.go:347] error creating error stream for port 3033 -> 3000: Timeout occurred

But whether I went to the pod or service, it just hung

However, I could use the forwarder built-in to Kubecost to get there

$ kubectl port-forward svc/kubecost-cost-analyzer -n kubecost 9090:9090
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090
Handling connection for 9090
Handling connection for 9090

Which actually gets us data beyond the paywall’ed 2 weeks

Summary

We started this by fixing a stuck Prometheus that was preventing Kubecost from getting new data. Once fixed, we easily upgraded to the latest version. We explored email alerts and by virtue of an email integration, PagerDuty alerts as well. Looking at the minor limitations of the free (community) version, I also touched on OpenCost and upgrading that as well.

I still think having both Kubecost and OpenCost in a cluster is really handy for performance tuning. The fact that they can share a metrics collector is a bonus. I love the Grafana backend to Kubecost. Not that the UI is bad, but getting right in to Grafana makes it even better for extensibility and graphing.

For example, I could pull Resource Efficiency graphs on a specific namespace and period which gives a much more detailed view of the data

Or pod utilization metrics for whatever time range I desire by virtual of the Grafana forwarder on Kubecost

$ kubectl port-forward svc/kubecost-cost-analyzer -n kubecost 9090:9090
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090
Handling connection for 9090