Maintaining A Healthy Kubernetes Cluster with Datadog

Published: Sep 22, 2021 by Isaac Johnson

I make no bones about being a Datadog fan. I use it personally. Recently I had some health issues with my cluster and leveraged Datadog to help debug and solve them. Since I think this is a pretty common use case for Ops and DevOps admins, I wanted to detail how we use Datadog, not as a developer in this case, but as an Infra admin.

How can we pinpoint the choke points and root out the problems that can degrade a cluster or worse, risk putting a cluster into a fail state?

We’ll look into reports, solve some of the issues, then verify our solution. Lastly, we’ll show how to update the DD agent in the cluster to enable new features and setup Anomoly alerts to solve going forward.

Danger Will Robinson, Danger.

The first thing I noticed in my dashboard was the red indicator on my main node. Actually, to be honest, the first thing I noticed was the loud fan cranking away on the macbook as I went to my home office. After a few days, I thought, “golly, that doesn’t sound too good” and fired up my DD Dashboard to investigate.

/content/images/2021/09/datadog-findings-01.png

We can hop into details of the problematic node and see that it’s rather high on CPU usage. It seems to have really taken off around the start of the month.

/content/images/2021/09/datadog-findings-02.png

If we were managing a production system, we might want to check events or deployments around that time to see if we have a scaling our deployment issue.

My next path for debugging was to check the pods by namespace and the most CPU intensive ones.

/content/images/2021/09/datadog-findings-03.png

Pro-Tip: We can also bring up these dashboards on our phones with the App when on the go:

/content/images/2021/09/datadog-findings-31.jpg

I wanted to check the error logs. Sometimes a noisy bugger is throwing errors:

/content/images/2021/09/datadog-findings-04.png

unmerged state	{"error": "failed to merge label 'envoy_response_code', values are not equal: '200' != '503'", "errorVerbose": "failed to merge label 'envoy_response_code', values are not equal: '200' != '503'
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/metrics.mergeLabels
\t/home/circleci/project/app/kuma-dp/pkg/dataplane/metrics/merge.go:103
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/metrics.merge
\t/home/circleci/project/app/kuma-dp/pkg/dataplane/metrics/merge.go:63
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/metrics.MergeClusters
\t/home/circleci/project/app/kuma-dp/pkg/dataplane/metrics/merge.go:39
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/metrics.(*Hijacker).ServeHTTP
\t/home/circleci/project/app/kuma-dp/pkg/dataplane/metrics/server.go:109
net/http.serverHandler.ServeHTTP
\t/home/circleci/go/src/net/http/server.go:2887
net/http.(*conn).serve
\t/home/circleci/go/src/net/http/server.go:1952
runtime.goexit
\t/home/circleci/go/src/runtime/asm_amd64.s:1371"}

The more I dug, the more I saw a large chunk of noise came from Prometheus, which we deployed with Kumamesh. Since I’m not actively using the service mesh and don’t plan to, it was then decided the best option would be to remove it.

Removing Kumamesh

First I needed to redownload the binary.

Downloading kuma-1.2.3 to match our deployment (see blog):

builder@DESKTOP-72D2D9T:~$ mv /mnt/c/Users/isaac/Downloads/kuma-1.2.3-ubuntu-amd64.tar.gz ./
builder@DESKTOP-72D2D9T:~$ tar -xzvf kuma-1.2.3-ubuntu-amd64.tar.gz
./
./kuma-1.2.3/
./kuma-1.2.3/README
./kuma-1.2.3/NOTICE
./kuma-1.2.3/install_missing_crds.sh
./kuma-1.2.3/NOTICE-kumactl
./kuma-1.2.3/LICENSE
./kuma-1.2.3/conf/
./kuma-1.2.3/conf/kuma-cp.conf.yml
./kuma-1.2.3/bin/
./kuma-1.2.3/bin/kuma-dp
./kuma-1.2.3/bin/kuma-prometheus-sd
./kuma-1.2.3/bin/kuma-cp
./kuma-1.2.3/bin/envoy
./kuma-1.2.3/bin/kumactl
./kuma-1.2.3/bin/coredns

Now as you recall, we installed the metrics portion of kuma mesh with an apply:

$ ./kuma-1.2.3/bin/kumactl install metrics | kubectl apply -f -
namespace/kuma-metrics created
podsecuritypolicy.policy/grafana created
serviceaccount/prometheus-alertmanager created
serviceaccount/prometheus-kube-state-metrics created
serviceaccount/prometheus-node-exporter created
serviceaccount/prometheus-pushgateway created
serviceaccount/prometheus-server created
serviceaccount/grafana created
configmap/grafana created
configmap/prometheus-alertmanager created
configmap/provisioning-datasource created
configmap/provisioning-dashboards created
configmap/prometheus-server created
configmap/provisioning-dashboards-0 created
configmap/provisioning-dashboards-1 created
configmap/provisioning-dashboards-2 created
configmap/provisioning-dashboards-3 created
configmap/provisioning-dashboards-4 created
persistentvolumeclaim/prometheus-alertmanager created
persistentvolumeclaim/prometheus-server created
clusterrole.rbac.authorization.k8s.io/grafana-clusterrole created
Warning: rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
clusterrole.rbac.authorization.k8s.io/prometheus-alertmanager created
clusterrole.rbac.authorization.k8s.io/prometheus-kube-state-metrics created
clusterrole.rbac.authorization.k8s.io/prometheus-pushgateway created
clusterrole.rbac.authorization.k8s.io/prometheus-server created
clusterrolebinding.rbac.authorization.k8s.io/grafana-clusterrolebinding created
Warning: rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
clusterrolebinding.rbac.authorization.k8s.io/prometheus-alertmanager created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-kube-state-metrics created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-pushgateway created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-server created
Warning: rbac.authorization.k8s.io/v1beta1 Role is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 Role
role.rbac.authorization.k8s.io/grafana created
Warning: rbac.authorization.k8s.io/v1beta1 RoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 RoleBinding
rolebinding.rbac.authorization.k8s.io/grafana created
service/grafana created
service/prometheus-alertmanager created
service/prometheus-kube-state-metrics created
service/prometheus-node-exporter created
service/prometheus-pushgateway created
service/prometheus-server created
daemonset.apps/prometheus-node-exporter created
deployment.apps/grafana created
deployment.apps/prometheus-alertmanager created
deployment.apps/prometheus-kube-state-metrics created
deployment.apps/prometheus-pushgateway created
deployment.apps/prometheus-server created

Let’s do the reverse;

$   ./kuma-1.2.3/bin/kumactl install metrics | kubectl delete -f -
WARNING: Unable to confirm the server supports this kumactl version
namespace "kuma-metrics" deleted
podsecuritypolicy.policy "grafana" deleted
serviceaccount "prometheus-alertmanager" deleted
serviceaccount "prometheus-kube-state-metrics" deleted
serviceaccount "prometheus-node-exporter" deleted
serviceaccount "prometheus-pushgateway" deleted
serviceaccount "prometheus-server" deleted
serviceaccount "grafana" deleted
configmap "grafana" deleted
configmap "prometheus-alertmanager" deleted
configmap "provisioning-datasource" deleted
configmap "provisioning-dashboards" deleted
configmap "prometheus-server" deleted
configmap "provisioning-dashboards-0" deleted
configmap "provisioning-dashboards-1" deleted
configmap "provisioning-dashboards-2" deleted
configmap "provisioning-dashboards-3" deleted
configmap "provisioning-dashboards-4" deleted
persistentvolumeclaim "prometheus-alertmanager" deleted
persistentvolumeclaim "prometheus-server" deleted
clusterrole.rbac.authorization.k8s.io "grafana-clusterrole" deleted
Warning: rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
clusterrole.rbac.authorization.k8s.io "prometheus-alertmanager" deleted
clusterrole.rbac.authorization.k8s.io "prometheus-kube-state-metrics" deleted
clusterrole.rbac.authorization.k8s.io "prometheus-pushgateway" deleted
clusterrole.rbac.authorization.k8s.io "prometheus-server" deleted
clusterrolebinding.rbac.authorization.k8s.io "grafana-clusterrolebinding" deleted
Warning: rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
clusterrolebinding.rbac.authorization.k8s.io "prometheus-alertmanager" deleted
clusterrolebinding.rbac.authorization.k8s.io "prometheus-kube-state-metrics" deleted
clusterrolebinding.rbac.authorization.k8s.io "prometheus-pushgateway" deleted
clusterrolebinding.rbac.authorization.k8s.io "prometheus-server" deleted
Warning: rbac.authorization.k8s.io/v1beta1 Role is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 Role
role.rbac.authorization.k8s.io "grafana" deleted
Warning: rbac.authorization.k8s.io/v1beta1 RoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 RoleBinding
rolebinding.rbac.authorization.k8s.io "grafana" deleted
service "grafana" deleted
service "prometheus-alertmanager" deleted
service "prometheus-kube-state-metrics" deleted
service "prometheus-node-exporter" deleted
service "prometheus-pushgateway" deleted
service "prometheus-server" deleted
daemonset.apps "prometheus-node-exporter" deleted
deployment.apps "grafana" deleted
deployment.apps "prometheus-alertmanager" deleted
deployment.apps "prometheus-kube-state-metrics" deleted
deployment.apps "prometheus-pushgateway" deleted
deployment.apps "prometheus-server" deleted

We can see the results on the left graphs right away. A decrease in pods per namespace and node:

/content/images/2021/09/datadog-findings-05.png

I checked the main node right after:

$ kubectl describe node isaac-macbookair
Name:               isaac-macbookair
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=k3s
                    beta.kubernetes.io/os=linux
                    k3s.io/hostname=isaac-macbookair
                    k3s.io/internal-ip=192.168.1.77
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=isaac-macbookair
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=true
                    node.kubernetes.io/instance-type=k3s
Annotations:        flannel.alpha.coreos.com/backend-data: {"VtepMAC":"22:ee:7b:23:07:17"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 192.168.1.77
                    k3s.io/node-args: ["server","--disable","traefik"]
                    k3s.io/node-config-hash: OHODIJ63LAVZLSEKQXPLUWENVVYMVG7OEAXDSRLXOHSZVZ5RV3DQ====
                    k3s.io/node-env:
                      {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/b9574be94e4edbdbb93a39a2cb1f4e4df3ba699171a8b86863d1e8c421c91f63","K3S_TOKEN":"********"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sat, 26 Dec 2020 08:55:04 -0600
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  isaac-macbookair
  AcquireTime:     <unset>
  RenewTime:       Mon, 13 Sep 2021 22:00:06 -0500
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 14 May 2021 17:51:08 -0500   Fri, 14 May 2021 17:51:08 -0500   FlannelIsUp                  Flannel is running on this node
  MemoryPressure       False   Mon, 13 Sep 2021 21:59:19 -0500   Wed, 01 Sep 2021 21:01:26 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 13 Sep 2021 21:59:19 -0500   Wed, 01 Sep 2021 21:01:26 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 13 Sep 2021 21:59:19 -0500   Wed, 01 Sep 2021 21:01:26 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Mon, 13 Sep 2021 21:59:19 -0500   Thu, 02 Sep 2021 03:54:21 -0500   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  192.168.1.77
  Hostname:    isaac-macbookair
Capacity:
  cpu:                4
  ephemeral-storage:  115587104Ki
  hugepages-2Mi:      0
  memory:             8057256Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  112443134684
  hugepages-2Mi:      0
  memory:             8057256Ki
  pods:               110
System Info:
  Machine ID:                 6c625fc1db184a97b858c78a095cef50
  System UUID:                4a9a013c-6cfe-6857-95bf-2be65a68f669
  Boot ID:                    29fdb914-29ea-4b5e-a9ae-cf3bd37ff5af
  Kernel Version:             5.4.0-58-generic
  OS Image:                   Ubuntu 20.04.1 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.4.3-k3s1
  Kubelet Version:            v1.19.5+k3s2
  Kube-Proxy Version:         v1.19.5+k3s2
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
ProviderID:                   k3s://isaac-macbookair
Non-terminated Pods:          (53 in total)
  Namespace                   Name                                                     CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                                     ------------  ----------  ---------------  -------------  ---
  default                     svclb-react-form-dpklg                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         158d
  default                     perl-debugger-5967f99ff6-c28zr                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     nodesqswatcher-deployment-67b65f645f-c8vd6               0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     mongo-x86-release-mongodb-7cb86d48f8-lcvb2               0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  standuptime                 vote-back-myrelease1346-5b56967cd4-d6gkt                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     dapr-dashboard-6ff6f44778-qrdnd                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  standuptime                 vote-front-myrelease1344-5f74d8468d-4rv29                250m (6%)     500m (12%)  0 (0%)           0 (0%)         112d
  default                     cm-acme-http-solver-6smb9                                10m (0%)      100m (2%)   64Mi (0%)        64Mi (0%)      98d
  testing                     azdo-nomad-deployment-5478fd77c5-r58tc                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     vote-back-azure-vote-1608995981-5df9f78fd8-fxpsp         0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  standuptime                 azdo-chefdemo-deployment-568fb9f75f-np5sz                0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  default                     harbor-registry-harbor-notary-server-779c6bddd5-rcdwz    0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  default                     node-subscriber-5c75bfc99d-h5cg4                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     daprtweeter-deployment-68b49d6856-mg7bq                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     perl-subscriber-57778545dc-bhlrg                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     python-subscriber-7d7694f594-9b8zm                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     kubewatch-5d466cffc8-wlkng                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         89d
  standuptime                 vote-front-myrelease1346-7d5b47c944-c8v8v                250m (6%)     500m (12%)  0 (0%)           0 (0%)         112d
  default                     busybox-6c446876c6-qvpfv                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         112d
  standuptime                 vote-front-mybluerelease-5c86c7bcd8-7qc5l                250m (6%)     500m (12%)  0 (0%)           0 (0%)         49d
  default                     vote-front-azure-vote-1608995981-588948447b-w4s4q        250m (6%)     500m (12%)  0 (0%)           0 (0%)         49d
  kube-system                 metrics-server-7b4f8b595-b4h2q                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     redis-master-0                                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         112d
  standuptime                 vote-front-myrelease1347-584bdb996d-vzlmr                250m (6%)     500m (12%)  0 (0%)           0 (0%)         43d
  standuptime                 vote-back-mybluerelease-589c6cdb8d-fsvwk                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     harbor-registry-harbor-notary-signer-c97648889-gr6rd     0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  standuptime                 vote-back-myrelease1347-b5484dbc4-j5hc8                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  standuptime                 vote-back-myrelease1344-5f675b5fb4-q75rj                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         112d
  default                     dapr-workflows-host-85d9d74777-mcq7n                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     docker-registry-6d9dc74c67-8vp68                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  kube-system                 coredns-66c464876b-qkk6t                                 100m (2%)     0 (0%)      70Mi (0%)        170Mi (2%)     112d
  istio-system                kiali-5bfbd64d59-m5sfj                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         42d
  default                     nfs-client-provisioner-d54c468c4-vwzln                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  kube-system                 local-path-provisioner-7ff9579c6-rbrqb                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         112d
  default                     dapr-sidecar-injector-56b8954855-sqn8n                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  cert-manager                cert-manager-86548b886-vhtwv                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     dapr-sentry-958fdd984-7cwrr                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         99d
  cert-manager                cert-manager-webhook-578954cdd-9mhqz                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  cert-manager                cert-manager-cainjector-6d59c8d4f7-tbp6k                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  default                     dapr-operator-7867c79bf9-59ntm                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     otel-collector-67f645b9b7-8mkr6                          200m (5%)     1 (25%)     400Mi (5%)       2Gi (26%)      49d
  default                     k0s-master-76d596b48f-cq4gw                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  default                     harbor-registry-harbor-exporter-655dd658bb-q5jhn         0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  default                     my-release-ingress-nginx-controller-7978f85c6f-jcqhs     100m (2%)     0 (0%)      90Mi (1%)        0 (0%)         43d
  default                     harbor-registry-harbor-portal-76bdcc7969-2q8xv           0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     redis-slave-1                                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         112d
  default                     harbor-registry-harbor-core-7b4594d78d-9x22l             0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     react-form-79c7989844-xj728                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         99d
  default                     nodeeventwatcher-deployment-6dddc4858c-vjffp             0 (0%)        0 (0%)      0 (0%)           0 (0%)         85d
  default                     datadogrelease-c49d5                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         20d
  default                     echoapp-c56bfd446-m8t9x                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         96d
  local-path-storage          local-path-provisioner-5696dbb894-4n2wj                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         112d
  standuptime                 azdo-newagent-deployment-5b6b489cb6-nls4b                0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                1660m (41%)  3600m (90%)
  memory             624Mi (7%)   2282Mi (29%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
Events:              <none>

and then again 5 hours later

$ kubectl describe node isaac-macbookair
Name:               isaac-macbookair
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=k3s
                    beta.kubernetes.io/os=linux
                    k3s.io/hostname=isaac-macbookair
                    k3s.io/internal-ip=192.168.1.77
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=isaac-macbookair
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=true
                    node.kubernetes.io/instance-type=k3s
Annotations:        flannel.alpha.coreos.com/backend-data: {"VtepMAC":"22:ee:7b:23:07:17"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 192.168.1.77
                    k3s.io/node-args: ["server","--disable","traefik"]
                    k3s.io/node-config-hash: OHODIJ63LAVZLSEKQXPLUWENVVYMVG7OEAXDSRLXOHSZVZ5RV3DQ====
                    k3s.io/node-env:
                      {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/b9574be94e4edbdbb93a39a2cb1f4e4df3ba699171a8b86863d1e8c421c91f63","K3S_TOKEN":"********"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sat, 26 Dec 2020 08:55:04 -0600
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  isaac-macbookair
  AcquireTime:     <unset>
  RenewTime:       Tue, 14 Sep 2021 06:40:16 -0500
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 14 May 2021 17:51:08 -0500   Fri, 14 May 2021 17:51:08 -0500   FlannelIsUp                  Flannel is running on this node
  MemoryPressure       False   Tue, 14 Sep 2021 06:38:43 -0500   Wed, 01 Sep 2021 21:01:26 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 14 Sep 2021 06:38:43 -0500   Wed, 01 Sep 2021 21:01:26 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 14 Sep 2021 06:38:43 -0500   Wed, 01 Sep 2021 21:01:26 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 14 Sep 2021 06:38:43 -0500   Thu, 02 Sep 2021 03:54:21 -0500   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  192.168.1.77
  Hostname:    isaac-macbookair
Capacity:
  cpu:                4
  ephemeral-storage:  115587104Ki
  hugepages-2Mi:      0
  memory:             8057256Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  112443134684
  hugepages-2Mi:      0
  memory:             8057256Ki
  pods:               110
System Info:
  Machine ID:                 6c625fc1db184a97b858c78a095cef50
  System UUID:                4a9a013c-6cfe-6857-95bf-2be65a68f669
  Boot ID:                    29fdb914-29ea-4b5e-a9ae-cf3bd37ff5af
  Kernel Version:             5.4.0-58-generic
  OS Image:                   Ubuntu 20.04.1 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.4.3-k3s1
  Kubelet Version:            v1.19.5+k3s2
  Kube-Proxy Version:         v1.19.5+k3s2
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
ProviderID:                   k3s://isaac-macbookair
Non-terminated Pods:          (53 in total)
  Namespace                   Name                                                     CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                                     ------------  ----------  ---------------  -------------  ---
  default                     svclb-react-form-dpklg                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         158d
  default                     perl-debugger-5967f99ff6-c28zr                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     nodesqswatcher-deployment-67b65f645f-c8vd6               0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     mongo-x86-release-mongodb-7cb86d48f8-lcvb2               0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  standuptime                 vote-back-myrelease1346-5b56967cd4-d6gkt                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     dapr-dashboard-6ff6f44778-qrdnd                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  standuptime                 vote-front-myrelease1344-5f74d8468d-4rv29                250m (6%)     500m (12%)  0 (0%)           0 (0%)         112d
  default                     cm-acme-http-solver-6smb9                                10m (0%)      100m (2%)   64Mi (0%)        64Mi (0%)      98d
  testing                     azdo-nomad-deployment-5478fd77c5-r58tc                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     vote-back-azure-vote-1608995981-5df9f78fd8-fxpsp         0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  standuptime                 azdo-chefdemo-deployment-568fb9f75f-np5sz                0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  default                     harbor-registry-harbor-notary-server-779c6bddd5-rcdwz    0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  default                     node-subscriber-5c75bfc99d-h5cg4                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     daprtweeter-deployment-68b49d6856-mg7bq                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     perl-subscriber-57778545dc-bhlrg                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     python-subscriber-7d7694f594-9b8zm                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     kubewatch-5d466cffc8-wlkng                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         89d
  standuptime                 vote-front-myrelease1346-7d5b47c944-c8v8v                250m (6%)     500m (12%)  0 (0%)           0 (0%)         112d
  default                     busybox-6c446876c6-qvpfv                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         112d
  standuptime                 vote-front-mybluerelease-5c86c7bcd8-7qc5l                250m (6%)     500m (12%)  0 (0%)           0 (0%)         49d
  default                     vote-front-azure-vote-1608995981-588948447b-w4s4q        250m (6%)     500m (12%)  0 (0%)           0 (0%)         49d
  kube-system                 metrics-server-7b4f8b595-b4h2q                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     redis-master-0                                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         112d
  standuptime                 vote-front-myrelease1347-584bdb996d-vzlmr                250m (6%)     500m (12%)  0 (0%)           0 (0%)         43d
  standuptime                 vote-back-mybluerelease-589c6cdb8d-fsvwk                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     harbor-registry-harbor-notary-signer-c97648889-gr6rd     0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  standuptime                 vote-back-myrelease1347-b5484dbc4-j5hc8                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  standuptime                 vote-back-myrelease1344-5f675b5fb4-q75rj                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         112d
  default                     dapr-workflows-host-85d9d74777-mcq7n                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     docker-registry-6d9dc74c67-8vp68                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  kube-system                 coredns-66c464876b-qkk6t                                 100m (2%)     0 (0%)      70Mi (0%)        170Mi (2%)     112d
  istio-system                kiali-5bfbd64d59-m5sfj                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         42d
  default                     nfs-client-provisioner-d54c468c4-vwzln                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  kube-system                 local-path-provisioner-7ff9579c6-rbrqb                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         112d
  default                     dapr-sidecar-injector-56b8954855-sqn8n                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  cert-manager                cert-manager-86548b886-vhtwv                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     dapr-sentry-958fdd984-7cwrr                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         99d
  cert-manager                cert-manager-webhook-578954cdd-9mhqz                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  cert-manager                cert-manager-cainjector-6d59c8d4f7-tbp6k                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  default                     dapr-operator-7867c79bf9-59ntm                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     otel-collector-67f645b9b7-8mkr6                          200m (5%)     1 (25%)     400Mi (5%)       2Gi (26%)      49d
  default                     k0s-master-76d596b48f-cq4gw                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  default                     harbor-registry-harbor-exporter-655dd658bb-q5jhn         0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  default                     my-release-ingress-nginx-controller-7978f85c6f-jcqhs     100m (2%)     0 (0%)      90Mi (1%)        0 (0%)         43d
  default                     harbor-registry-harbor-portal-76bdcc7969-2q8xv           0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     redis-slave-1                                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         112d
  default                     harbor-registry-harbor-core-7b4594d78d-9x22l             0 (0%)        0 (0%)      0 (0%)           0 (0%)         43d
  default                     react-form-79c7989844-xj728                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         99d
  default                     nodeeventwatcher-deployment-6dddc4858c-vjffp             0 (0%)        0 (0%)      0 (0%)           0 (0%)         85d
  default                     datadogrelease-c49d5                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         20d
  default                     echoapp-c56bfd446-m8t9x                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         96d
  standuptime                 azdo-newagent-deployment-5b6b489cb6-nls4b                0 (0%)        0 (0%)      0 (0%)           0 (0%)         49d
  local-path-storage          local-path-provisioner-5696dbb894-4n2wj                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         112d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                1660m (41%)  3600m (90%)
  memory             624Mi (7%)   2282Mi (29%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
Events:              <none>

I am still worried about that 90% CPU.

This is a Macbook Air. It doesn’t exactly have *“server class cooling.

We can see from the list all the pods that have high CPU usage.. This is my concern more than anything:

  Namespace                   Name                                                     CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                                     ------------  ----------  ---------------  -------------  ---
  standuptime                 vote-front-myrelease1344-5f74d8468d-4rv29                250m (6%)     500m (12%)  0 (0%)           0 (0%)         112d
  default                     cm-acme-http-solver-6smb9                                10m (0%)      100m (2%)   64Mi (0%)        64Mi (0%)      98d
  standuptime                 vote-front-myrelease1346-7d5b47c944-c8v8v                250m (6%)     500m (12%)  0 (0%)           0 (0%)         112d
  standuptime                 vote-front-mybluerelease-5c86c7bcd8-7qc5l                250m (6%)     500m (12%)  0 (0%)           0 (0%)         49d
  default                     vote-front-azure-vote-1608995981-588948447b-w4s4q        250m (6%)     500m (12%)  0 (0%)           0 (0%)         49d
  standuptime                 vote-front-myrelease1347-584bdb996d-vzlmr                250m (6%)     500m (12%)  0 (0%)           0 (0%)         43d
  kube-system                 coredns-66c464876b-qkk6t                                 100m (2%)     0 (0%)      70Mi (0%)        170Mi (2%)     112d
  default                     otel-collector-67f645b9b7-8mkr6                          200m (5%)     1 (25%)     400Mi (5%)       2Gi (26%)      49d
  default                     my-release-ingress-nginx-controller-7978f85c6f-jcqhs     100m (2%)     0 (0%)      90Mi (1%)        0 (0%)         43d

some of these are important and I would rather not restart them. But the “vote” apps are of little import and can be rotated.

more cleanup

Let’s first try cycling the larger CPU consuming pods.

As an aside, we can get all the pods running just on this node with a field selector kubectl query:

$ kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=isaac-macbookair
NAMESPACE            NAME                                                    READY   STATUS             RESTARTS   AGE    IP            NODE               NOMINATED NODE   READINESS GATES
default              svclb-react-form-dpklg                                  1/1     Running            0          164d   10.42.0.110   isaac-macbookair   <none>           <none>
default              perl-debugger-5967f99ff6-c28zr                          1/1     Running            0          50d    10.42.0.165   isaac-macbookair   <none>           <none>
default              nodesqswatcher-deployment-67b65f645f-c8vd6              1/1     Running            0          50d    10.42.0.163   isaac-macbookair   <none>           <none>
default              mongo-x86-release-mongodb-7cb86d48f8-lcvb2              1/1     Running            0          56d    10.42.0.147   isaac-macbookair   <none>           <none>
standuptime          vote-back-myrelease1346-5b56967cd4-d6gkt                1/1     Running            0          50d    10.42.0.160   isaac-macbookair   <none>           <none>
default              dapr-dashboard-6ff6f44778-qrdnd                         1/1     Running            0          50d    10.42.0.161   isaac-macbookair   <none>           <none>
standuptime          vote-front-myrelease1344-5f74d8468d-4rv29               1/1     Running            0          119d   10.42.0.105   isaac-macbookair   <none>           <none>
default              cm-acme-http-solver-6smb9                               1/1     Running            0          104d   10.42.0.119   isaac-macbookair   <none>           <none>
testing              azdo-nomad-deployment-5478fd77c5-r58tc                  1/1     Running            0          50d    10.42.0.178   isaac-macbookair   <none>           <none>
default              vote-back-azure-vote-1608995981-5df9f78fd8-fxpsp        1/1     Running            0          56d    10.42.0.145   isaac-macbookair   <none>           <none>
standuptime          azdo-chefdemo-deployment-568fb9f75f-np5sz               1/1     Running            0          56d    10.42.0.141   isaac-macbookair   <none>           <none>
default              harbor-registry-harbor-notary-server-779c6bddd5-rcdwz   1/1     Running            128        56d    10.42.0.146   isaac-macbookair   <none>           <none>
default              node-subscriber-5c75bfc99d-h5cg4                        1/1     Running            0          50d    10.42.0.162   isaac-macbookair   <none>           <none>
default              daprtweeter-deployment-68b49d6856-mg7bq                 1/1     Running            0          50d    10.42.0.174   isaac-macbookair   <none>           <none>
default              perl-subscriber-57778545dc-bhlrg                        1/1     Running            0          50d    10.42.0.176   isaac-macbookair   <none>           <none>
default              python-subscriber-7d7694f594-9b8zm                      1/1     Running            0          50d    10.42.0.177   isaac-macbookair   <none>           <none>
default              kubewatch-5d466cffc8-wlkng                              1/1     Running            0          95d    10.42.0.128   isaac-macbookair   <none>           <none>
standuptime          vote-front-myrelease1346-7d5b47c944-c8v8v               1/1     Running            0          119d   10.42.0.96    isaac-macbookair   <none>           <none>
default              busybox-6c446876c6-qvpfv                                1/1     Running            0          119d   10.42.0.104   isaac-macbookair   <none>           <none>
standuptime          vote-front-mybluerelease-5c86c7bcd8-7qc5l               1/1     Running            0          56d    10.42.0.151   isaac-macbookair   <none>           <none>
default              vote-front-azure-vote-1608995981-588948447b-w4s4q       1/1     Running            0          56d    10.42.0.143   isaac-macbookair   <none>           <none>
kube-system          metrics-server-7b4f8b595-b4h2q                          1/1     Running            21         50d    10.42.0.156   isaac-macbookair   <none>           <none>
default              redis-master-0                                          1/1     Running            0          119d   10.42.0.107   isaac-macbookair   <none>           <none>
standuptime          vote-front-myrelease1347-584bdb996d-vzlmr               1/1     Running            0          50d    10.42.0.170   isaac-macbookair   <none>           <none>
standuptime          vote-back-mybluerelease-589c6cdb8d-fsvwk                1/1     Running            0          50d    10.42.0.158   isaac-macbookair   <none>           <none>
default              harbor-registry-harbor-notary-signer-c97648889-gr6rd    1/1     Running            128        56d    10.42.0.153   isaac-macbookair   <none>           <none>
standuptime          vote-back-myrelease1347-b5484dbc4-j5hc8                 1/1     Running            0          56d    10.42.0.142   isaac-macbookair   <none>           <none>
standuptime          vote-back-myrelease1344-5f675b5fb4-q75rj                1/1     Running            0          119d   10.42.0.98    isaac-macbookair   <none>           <none>
default              dapr-workflows-host-85d9d74777-mcq7n                    1/1     Running            0          50d    10.42.0.171   isaac-macbookair   <none>           <none>
default              docker-registry-6d9dc74c67-8vp68                        1/1     Running            19         50d    10.42.0.168   isaac-macbookair   <none>           <none>
kube-system          coredns-66c464876b-qkk6t                                1/1     Running            16         119d   10.42.0.99    isaac-macbookair   <none>           <none>
istio-system         kiali-5bfbd64d59-m5sfj                                  1/1     Running            15         49d    10.42.0.179   isaac-macbookair   <none>           <none>
default              nfs-client-provisioner-d54c468c4-vwzln                  1/1     Running            10         56d    10.42.0.148   isaac-macbookair   <none>           <none>
kube-system          local-path-provisioner-7ff9579c6-rbrqb                  1/1     Running            54         119d   10.42.0.103   isaac-macbookair   <none>           <none>
default              dapr-sidecar-injector-56b8954855-sqn8n                  1/1     Running            47         50d    10.42.0.164   isaac-macbookair   <none>           <none>
cert-manager         cert-manager-86548b886-vhtwv                            1/1     Running            11         50d    10.42.0.172   isaac-macbookair   <none>           <none>
default              dapr-sentry-958fdd984-7cwrr                             1/1     Running            20         106d   10.42.0.116   isaac-macbookair   <none>           <none>
cert-manager         cert-manager-webhook-578954cdd-9mhqz                    1/1     Running            22         50d    10.42.0.173   isaac-macbookair   <none>           <none>
cert-manager         cert-manager-cainjector-6d59c8d4f7-tbp6k                1/1     Running            55         56d    10.42.0.144   isaac-macbookair   <none>           <none>
default              dapr-operator-7867c79bf9-59ntm                          1/1     Running            59         50d    10.42.0.157   isaac-macbookair   <none>           <none>
default              otel-collector-67f645b9b7-8mkr6                         1/1     Running            23         56d    10.42.0.155   isaac-macbookair   <none>           <none>
default              k0s-master-76d596b48f-cq4gw                             1/1     Running            1          56d    10.42.0.152   isaac-macbookair   <none>           <none>
default              harbor-registry-harbor-exporter-655dd658bb-q5jhn        1/1     Running            127        56d    10.42.0.149   isaac-macbookair   <none>           <none>
default              my-release-ingress-nginx-controller-7978f85c6f-jcqhs    1/1     Running            44         50d    10.42.0.169   isaac-macbookair   <none>           <none>
default              harbor-registry-harbor-portal-76bdcc7969-2q8xv          1/1     Running            12         50d    10.42.0.175   isaac-macbookair   <none>           <none>
default              redis-slave-1                                           1/1     Running            4          119d   10.42.0.109   isaac-macbookair   <none>           <none>
default              harbor-registry-harbor-core-7b4594d78d-9x22l            1/1     Running            1988       50d    10.42.0.166   isaac-macbookair   <none>           <none>
default              react-form-79c7989844-xj728                             2/2     Running            24         106d   10.42.0.117   isaac-macbookair   <none>           <none>
default              nodeeventwatcher-deployment-6dddc4858c-vjffp            2/2     Running            24         92d    10.42.0.129   isaac-macbookair   <none>           <none>
default              datadogrelease-c49d5                                    3/3     Running            4          27d    10.42.0.184   isaac-macbookair   <none>           <none>
default              echoapp-c56bfd446-m8t9x                                 2/2     Running            769        102d   10.42.0.122   isaac-macbookair   <none>           <none>
standuptime          azdo-newagent-deployment-5b6b489cb6-nls4b               0/1     CrashLoopBackOff   15735      56d    10.42.0.150   isaac-macbookair   <none>           <none>
local-path-storage   local-path-provisioner-5696dbb894-4n2wj                 0/1     CrashLoopBackOff   33587      119d   10.42.0.101   isaac-macbookair   <none>           <none>

Cleaning Failed Pods/Deployments

The other issue I see as an issue are the two failing pods:

standuptime          azdo-newagent-deployment-5b6b489cb6-nls4b               0/1     CrashLoopBackOff   15735      56d    10.42.0.150   isaac-macbookair   <none>           <none>
local-path-storage   local-path-provisioner-5696dbb894-4n2wj                 0/1     CrashLoopBackOff   33587      119d   10.42.0.101   isaac-macbookair   <none>           <none>

We can see we already have a local-path-provisioner setup in kube-system:

$ kubectl get pods --all-namespaces | grep local-path
kube-system          local-path-provisioner-7ff9579c6-rbrqb                  1/1     Running             54         119d
local-path-storage   local-path-provisioner-5696dbb894-4n2wj                 0/1     CrashLoopBackOff    33589      119d

Looks like there is app label on the failing:

$ kubectl describe pod local-path-provisioner-5696dbb894-4n2wj -n local-path-storage | grep app
Labels:       app=local-path-provisioner

Let’s clean that one first, since it clearly is unused and just failing. I like to back up deployments first.

$ kubectl get deployments -n local-path-storage
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
local-path-provisioner   0/1     1            0           261d

$ kubectl get deployment local-path-provisioner -n local-path-storage -o yaml > lpp-dep-local-path-storage.yaml
$ kubectl get cm local-path-config -n local-path-storage -o yaml > cm-lpc-local-path-storage.yaml

$ kubectl delete deployment  local-path-provisioner -n local-path-storage
deployment.apps "local-path-provisioner" deleted

Let’s remove that AzDO agent. It’s not working and not needed right now. When it comes time to redeploy, i’ll likely use fresh a fresh PAT and Helm deploy.

I quick check on helm deploys there first

$ helm list -n standuptime
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                                                                                                                           APP VERSION
mybluerelease   standuptime     4               2021-03-03 02:46:19.788962887 +0000 UTC deployed        azure-vote-0.1.1
myrelease1344   standuptime     1               2021-03-03 02:58:52.295009216 +0000 UTC deployed        azure-vote-0.1.1
myrelease1346   standuptime     1               2021-03-03 03:11:30.336257423 +0000 UTC deployed        azure-vote-0.1.1
myrelease1347   standuptime     1               2021-03-03 03:13:25.38505445 +0000 UTC  deployed        azure-vote-0.1.1

Must have done as YAML Deployment

$ kubectl get deployment -n standuptime | grep azdo
azdo-chefdemo-deployment   1/1     1            1           216d
azdo-newagent-deployment   0/1     1            0           256d

Then backup and delete

$ kubectl get deployment azdo-newagent-deployment -n standuptime -o yaml > anad-standuptime.yaml
$ kubectl delete deployment azdo-newagent-deployment -n standuptime
deployment.apps "azdo-newagent-deployment" deleted

Checking the pods on the overloaded node, we see the failed ones are gone now:

$ kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=isaac-macbookair
NAMESPACE      NAME                                                    READY   STATUS    RESTARTS   AGE    IP            NODE               NOMINATED NODE   READINESS GATES
default        svclb-react-form-dpklg                                  1/1     Running   0          164d   10.42.0.110   isaac-macbookair   <none>           <none>
default        perl-debugger-5967f99ff6-c28zr                          1/1     Running   0          50d    10.42.0.165   isaac-macbookair   <none>           <none>
default        nodesqswatcher-deployment-67b65f645f-c8vd6              1/1     Running   0          50d    10.42.0.163   isaac-macbookair   <none>           <none>
default        mongo-x86-release-mongodb-7cb86d48f8-lcvb2              1/1     Running   0          56d    10.42.0.147   isaac-macbookair   <none>           <none>
standuptime    vote-back-myrelease1346-5b56967cd4-d6gkt                1/1     Running   0          50d    10.42.0.160   isaac-macbookair   <none>           <none>
default        dapr-dashboard-6ff6f44778-qrdnd                         1/1     Running   0          50d    10.42.0.161   isaac-macbookair   <none>           <none>
standuptime    vote-front-myrelease1344-5f74d8468d-4rv29               1/1     Running   0          119d   10.42.0.105   isaac-macbookair   <none>           <none>
default        cm-acme-http-solver-6smb9                               1/1     Running   0          104d   10.42.0.119   isaac-macbookair   <none>           <none>
testing        azdo-nomad-deployment-5478fd77c5-r58tc                  1/1     Running   0          50d    10.42.0.178   isaac-macbookair   <none>           <none>
default        vote-back-azure-vote-1608995981-5df9f78fd8-fxpsp        1/1     Running   0          56d    10.42.0.145   isaac-macbookair   <none>           <none>
standuptime    azdo-chefdemo-deployment-568fb9f75f-np5sz               1/1     Running   0          56d    10.42.0.141   isaac-macbookair   <none>           <none>
default        harbor-registry-harbor-notary-server-779c6bddd5-rcdwz   1/1     Running   128        56d    10.42.0.146   isaac-macbookair   <none>           <none>
default        node-subscriber-5c75bfc99d-h5cg4                        1/1     Running   0          50d    10.42.0.162   isaac-macbookair   <none>           <none>
default        daprtweeter-deployment-68b49d6856-mg7bq                 1/1     Running   0          50d    10.42.0.174   isaac-macbookair   <none>           <none>
default        perl-subscriber-57778545dc-bhlrg                        1/1     Running   0          50d    10.42.0.176   isaac-macbookair   <none>           <none>
default        python-subscriber-7d7694f594-9b8zm                      1/1     Running   0          50d    10.42.0.177   isaac-macbookair   <none>           <none>
default        kubewatch-5d466cffc8-wlkng                              1/1     Running   0          95d    10.42.0.128   isaac-macbookair   <none>           <none>
standuptime    vote-front-myrelease1346-7d5b47c944-c8v8v               1/1     Running   0          119d   10.42.0.96    isaac-macbookair   <none>           <none>
default        busybox-6c446876c6-qvpfv                                1/1     Running   0          119d   10.42.0.104   isaac-macbookair   <none>           <none>
standuptime    vote-front-mybluerelease-5c86c7bcd8-7qc5l               1/1     Running   0          56d    10.42.0.151   isaac-macbookair   <none>           <none>
default        vote-front-azure-vote-1608995981-588948447b-w4s4q       1/1     Running   0          56d    10.42.0.143   isaac-macbookair   <none>           <none>
kube-system    metrics-server-7b4f8b595-b4h2q                          1/1     Running   21         50d    10.42.0.156   isaac-macbookair   <none>           <none>
default        redis-master-0                                          1/1     Running   0          119d   10.42.0.107   isaac-macbookair   <none>           <none>
standuptime    vote-front-myrelease1347-584bdb996d-vzlmr               1/1     Running   0          50d    10.42.0.170   isaac-macbookair   <none>           <none>
standuptime    vote-back-mybluerelease-589c6cdb8d-fsvwk                1/1     Running   0          50d    10.42.0.158   isaac-macbookair   <none>           <none>
default        harbor-registry-harbor-notary-signer-c97648889-gr6rd    1/1     Running   128        56d    10.42.0.153   isaac-macbookair   <none>           <none>
standuptime    vote-back-myrelease1347-b5484dbc4-j5hc8                 1/1     Running   0          56d    10.42.0.142   isaac-macbookair   <none>           <none>
standuptime    vote-back-myrelease1344-5f675b5fb4-q75rj                1/1     Running   0          119d   10.42.0.98    isaac-macbookair   <none>           <none>
default        dapr-workflows-host-85d9d74777-mcq7n                    1/1     Running   0          50d    10.42.0.171   isaac-macbookair   <none>           <none>
default        docker-registry-6d9dc74c67-8vp68                        1/1     Running   19         50d    10.42.0.168   isaac-macbookair   <none>           <none>
kube-system    coredns-66c464876b-qkk6t                                1/1     Running   16         119d   10.42.0.99    isaac-macbookair   <none>           <none>
istio-system   kiali-5bfbd64d59-m5sfj                                  1/1     Running   15         49d    10.42.0.179   isaac-macbookair   <none>           <none>
default        nfs-client-provisioner-d54c468c4-vwzln                  1/1     Running   10         56d    10.42.0.148   isaac-macbookair   <none>           <none>
kube-system    local-path-provisioner-7ff9579c6-rbrqb                  1/1     Running   54         119d   10.42.0.103   isaac-macbookair   <none>           <none>
default        dapr-sidecar-injector-56b8954855-sqn8n                  1/1     Running   47         50d    10.42.0.164   isaac-macbookair   <none>           <none>
cert-manager   cert-manager-86548b886-vhtwv                            1/1     Running   11         50d    10.42.0.172   isaac-macbookair   <none>           <none>
default        dapr-sentry-958fdd984-7cwrr                             1/1     Running   20         106d   10.42.0.116   isaac-macbookair   <none>           <none>
cert-manager   cert-manager-webhook-578954cdd-9mhqz                    1/1     Running   22         50d    10.42.0.173   isaac-macbookair   <none>           <none>
cert-manager   cert-manager-cainjector-6d59c8d4f7-tbp6k                1/1     Running   55         56d    10.42.0.144   isaac-macbookair   <none>           <none>
default        dapr-operator-7867c79bf9-59ntm                          1/1     Running   59         50d    10.42.0.157   isaac-macbookair   <none>           <none>
default        otel-collector-67f645b9b7-8mkr6                         1/1     Running   23         56d    10.42.0.155   isaac-macbookair   <none>           <none>
default        k0s-master-76d596b48f-cq4gw                             1/1     Running   1          56d    10.42.0.152   isaac-macbookair   <none>           <none>
default        harbor-registry-harbor-exporter-655dd658bb-q5jhn        1/1     Running   127        56d    10.42.0.149   isaac-macbookair   <none>           <none>
default        my-release-ingress-nginx-controller-7978f85c6f-jcqhs    1/1     Running   44         50d    10.42.0.169   isaac-macbookair   <none>           <none>
default        harbor-registry-harbor-portal-76bdcc7969-2q8xv          1/1     Running   12         50d    10.42.0.175   isaac-macbookair   <none>           <none>
default        redis-slave-1                                           1/1     Running   4          119d   10.42.0.109   isaac-macbookair   <none>           <none>
default        harbor-registry-harbor-core-7b4594d78d-9x22l            1/1     Running   1988       50d    10.42.0.166   isaac-macbookair   <none>           <none>
default        react-form-79c7989844-xj728                             2/2     Running   24         106d   10.42.0.117   isaac-macbookair   <none>           <none>
default        nodeeventwatcher-deployment-6dddc4858c-vjffp            2/2     Running   24         92d    10.42.0.129   isaac-macbookair   <none>           <none>
default        datadogrelease-c49d5                                    3/3     Running   4          27d    10.42.0.184   isaac-macbookair   <none>           <none>
default        echoapp-c56bfd446-m8t9x                                 2/2     Running   769        102d   10.42.0.122   isaac-macbookair   <none>           <none>

rotating pods

We already determined the heavy load comes from these:

  standuptime                 vote-front-myrelease1344-5f74d8468d-4rv29                250m (6%)     500m (12%)  0 (0%)           0 (0%)         112d
  default                     cm-acme-http-solver-6smb9                                10m (0%)      100m (2%)   64Mi (0%)        64Mi (0%)      98d
  standuptime                 vote-front-myrelease1346-7d5b47c944-c8v8v                250m (6%)     500m (12%)  0 (0%)           0 (0%)         112d
  standuptime                 vote-front-mybluerelease-5c86c7bcd8-7qc5l                250m (6%)     500m (12%)  0 (0%)           0 (0%)         49d
  default                     vote-front-azure-vote-1608995981-588948447b-w4s4q        250m (6%)     500m (12%)  0 (0%)           0 (0%)         49d
  standuptime                 vote-front-myrelease1347-584bdb996d-vzlmr                250m (6%)     500m (12%)  0 (0%)           0 (0%)         43d
  kube-system                 coredns-66c464876b-qkk6t                                 100m (2%)     0 (0%)      70Mi (0%)        170Mi (2%)     112d
  default                     otel-collector-67f645b9b7-8mkr6                          200m (5%)     1 (25%)     400Mi (5%)       2Gi (26%)      49d
  default                     my-release-ingress-nginx-controller-7978f85c6f-jcqhs     100m (2%)     0 (0%)      90Mi (1%)        0 (0%)         43d

We can easily craft a line to cycle the pods that arent critical:

We can see the full set of pods in standuptime namespace on this host as such:

$ for pod in $(kubectl get pods -n standuptime --field-selector spec.nodeName=isaac-macbookair --output=jsonpath={.items..metadata.name}); do echo $pod; done
vote-back-myrelease1346-5b56967cd4-d6gkt
vote-front-myrelease1344-5f74d8468d-4rv29
azdo-chefdemo-deployment-568fb9f75f-np5sz
vote-front-myrelease1346-7d5b47c944-c8v8v
vote-front-mybluerelease-5c86c7bcd8-7qc5l
vote-front-myrelease1347-584bdb996d-vzlmr
vote-back-mybluerelease-589c6cdb8d-fsvwk
vote-back-myrelease1347-b5484dbc4-j5hc8
vote-back-myrelease1344-5f675b5fb4-q75rj

First, show all the azure-vote-front pods in standuptime namespace that are on this master node:

$ for pod in $(kubectl get pods -n standuptime --field-selector spec.nodeName=isaac-macbookair --output=jsonpath={.items..metadata.name}); do if [[ "$pod" == *"vote-front"* ]]; then echo $pod; fi; done
vote-front-myrelease1344-5f74d8468d-4rv29
vote-front-myrelease1346-7d5b47c944-c8v8v
vote-front-mybluerelease-5c86c7bcd8-7qc5l
vote-front-myrelease1347-584bdb996d-vzlmr

Now, in a similar fashion, bounce them all:

$ for pod in $(kubectl get pods -n standuptime --field-selector spec.nodeName=isaac-macbookair --output=jsonpath={.items..metadata.name}); do if [[ "$pod" == *"vote-front"* ]]; then kubectl delete pod $pod -n standuptime; fi; done
pod "vote-front-myrelease1344-5f74d8468d-4rv29" deleted
pod "vote-front-myrelease1346-7d5b47c944-c8v8v" deleted
pod "vote-front-mybluerelease-5c86c7bcd8-7qc5l" deleted
pod "vote-front-myrelease1347-584bdb996d-vzlmr" deleted

Now when we check the pods in standuptime running, it’s just the backend pool and the chef demo:

$ for pod in $(kubectl get pods -n standuptime --field-selector spec.nodeName=isaac-macbookair --output=jsonpath={.items..metadata.name}); do echo $pod; done
vote-back-myrelease1346-5b56967cd4-d6gkt
azdo-chefdemo-deployment-568fb9f75f-np5sz
vote-back-mybluerelease-589c6cdb8d-fsvwk
vote-back-myrelease1347-b5484dbc4-j5hc8
vote-back-myrelease1344-5f675b5fb4-q75rj

Verification: Checking Health

How did we do? did we release the pressure valve on our primary host?

We can see indeed we brought the CPU down to a much more reasonable level:

$ kubectl describe node isaac-macbookair | tail -n 8
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                660m (16%)  1600m (40%)
  memory             624Mi (7%)  2282Mi (29%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:              <none>

Another thing I noticed in the Infra window was k0s running

/content/images/2021/09/datadog-findings-06.png

I SSHed to the host thinking I must have installed directly, but no service or binary existed. But a find on the name hinted that perhaps I had gone mad months ago and tried to run k0s in k3s.

$ sudo find / -name k0s -print
/run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/c52a1007f647a2a0dd1292fa4b155a2a7e1d15693c00ecb9a0dbaf547b4421f6/rootfs/run/k0s
/run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/c52a1007f647a2a0dd1292fa4b155a2a7e1d15693c00ecb9a0dbaf547b4421f6/rootfs/usr/local/bin/k0s
find: ‘/run/user/1000/doc’: Permission denied
find: ‘/run/user/1000/gvfs’: Permission denied
/var/lib/kubelet/pods/e4da904e-ffd6-4014-ae22-eb7382009132/volumes/kubernetes.io~empty-dir/k0smount/k0s
/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/29409/fs/run/k0s
/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/57514/fs/usr/local/bin/k0s
/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/57523/fs/run/k0s
/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/255/fs/usr/local/bin/k0s
/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/29404/fs/usr/local/bin/k0s
find: ‘/proc/2330753’: No such file or directory
find: ‘/proc/2330990’: No such file or directory

and indeed I found that deployment in default:

$ kubectl get deployments | grep k0s
k0s-master                             1/1     1            1           178d

Since that is wasting a lot of resources (and I know I’m not using it), let’s clean that up too.

$ kubectl get deployment k0s-master -o yaml > k0sm-default.yaml
$ kubectl delete deployment k0s-master
deployment.apps "k0s-master" deleted

Updating DD Agent

I noticed that I was not collecting process details on my hosts.

/content/images/2021/09/datadog-findings-07.png

This is easily remidied by following the steps here.

$ helm get values datadogrelease -o yaml
USER-SUPPLIED VALUES: null
clusterAgent:
  enabled: true
  metricsProvider:
    enabled: true
datadog:
  apiKeyExistingSecret: dd-secret
  apm:
    enabled: true
    port: 8126
  appKey: asdfasdfasdfasdfasdfasdfasdfasdfasdf
  logs:
    containerCollectAll: true
    enabled: true

$ helm get values datadogrelease -o yaml > ddvalues.yaml

# add to the end
$ cat ddvalues.yaml | tail -n6
  logs:
    containerCollectAll: true
    enabled: true
  processAgent:
    processCollection: true
    enabled: true

our former chart was 2.20.1

$ helm list | grep datadog
datadogrelease          default         3               2021-08-17 13:50:12.866989275 -0500 CDT deployed        datadog-2.20.1    

We might as well upgrade the agents while we are at it

$ helm repo add datadog https://helm.datadoghq.com
"datadog" already exists with the same configuration, skipping

$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "kuma" chart repository
...Successfully got an update from the "datadog" chart repository
Update Complete. ⎈Happy Helming!⎈

Now upgrade

$ helm upgrade -f ddvalues.yaml datadogrelease datadog/datadog
Release "datadogrelease" has been upgraded. Happy Helming!
NAME: datadogrelease
LAST DEPLOYED: Tue Sep 14 07:44:49 2021
NAMESPACE: default
STATUS: deployed
REVISION: 4
TEST SUITE: None
NOTES:
Datadog agents are spinning up on each node in your cluster. After a few
minutes, you should see your agents starting in your event stream:
    https://app.datadoghq.com/event/stream
You disabled creation of Secret containing API key, therefore it is expected
that you create Secret named 'dd-secret' which includes a key called 'api-key' containing the API key.

The Datadog Agent is listening on port 8126 for APM service.

#################################################################
####               WARNING: Deprecation notice               ####
#################################################################

The option `datadog.apm.enabled` is deprecated, please use `datadog.apm.portEnabled` to enable TCP communication to the trace-agent.
The option `datadog.apm.socketEnabled` is enabled by default and can be used to rely on unix socket or name-pipe communication.

verification

$ helm list | grep datadog
datadogrelease          default         4               2021-09-14 07:44:49.162285587 -0500 CDT deployed        datadog-2.22.1  

Checking up on updates:

$ kubectl get pods --all-namespaces | grep datadog
default        datadogrelease-kube-state-metrics-6b5956746b-ndcg9      1/1     Running             0          27d
default        datadogrelease-c49d5                                    3/3     Running             4          27d
default        datadogrelease-cluster-agent-5f5f84dd9b-cmc6r           1/1     Running             0          2m10s
default        datadogrelease-d7zc5                                    3/3     Running             0          119s
default        datadogrelease-5jsmh                                    0/3     Init:0/2            0          49s

$ kubectl get pods --all-namespaces | grep datadog
default        datadogrelease-kube-state-metrics-6b5956746b-ndcg9      1/1     Running             0          27d
default        datadogrelease-c49d5                                    3/3     Running             4          27d
default        datadogrelease-cluster-agent-5f5f84dd9b-cmc6r           1/1     Running             0          2m35s
default        datadogrelease-d7zc5                                    3/3     Running             0          2m24s
default        datadogrelease-5jsmh                                    2/3     Running             0          74s

The process explorer is now available to us:

/content/images/2021/09/datadog-findings-12.png

We can now see process details are being gathered:

/content/images/2021/09/datadog-findings-08.png

and if we hop on the host, we can indeed see our Dapr workflow dotnet servie is dominating the CPU

/content/images/2021/09/datadog-findings-09.png

$ ps -ef | grep dotnet
root      907065  903511 99 Jul25 ?        88-20:16:34 dotnet app/Dapr.Workflows.dll --workflows-path /workflows
isaac    2400949 2399891  0 07:49 pts/2    00:00:00 grep --color=auto dotnet

And our host list shows similar:

/content/images/2021/09/datadog-findings-10.png

I can bounce that node and see if its just needing a quick cycle

$ kubectl delete pod dapr-workflows-host-85d9d74777-mcq7n
pod "dapr-workflows-host-85d9d74777-mcq7n" deleted

And I can see we are now below 90% which was my goal.

/content/images/2021/09/datadog-findings-11.png

This is a bit outside the scope of cleanup. We do use that Workflow component regularily. So for now we have cleaned all we can.

What if we wanted to set an alert? Perhaps we cleanup the CPU issue and would like to know when our nodes get pegged.

DD Monitor (for future detection)

We can set up a new Anomoly detection monitor:

Here we set the metric of system.cpu.system. There are some similar metrics from which to choose. I used the “Evaluation” preview to hone in on the detail I was seeking.

/content/images/2021/09/datadog-findings-13.png

We then want to set some conditions. I might thin the host selector later. For now, on all hosts, if the max exceeds the anomoly detection for more than 15m, send the alert

/content/images/2021/09/datadog-findings-14.png

Lastly, we set the alert details and audience

/content/images/2021/09/datadog-findings-15.png

A test triggering shows what an alert will look like:

/content/images/2021/09/datadog-findings-16.png

Epilogue

Hours later i could hear fans again. But upon inspection, it was from a different laptop. I had to mute the DD Alert since it kept going off. But only later did i realize it was not a false alert.

/content/images/2021/09/datadog-findings-20.png

I could see the details of the host in question including CPU and memory details.

Next, I was able to go to the Infrastructure Dashboard and look at the Processes tab on the host with high CPU. From there i could sort on CPU usage and see it was the same problematic prcoess; the Dapr Workflow engine.

/content/images/2021/09/datadog-findings-21.png

I could then get details on that process on that host and see that indeed, after i had cycled the pod, it spiked usage on this host.

/content/images/2021/09/datadog-findings-22.png

Logging into the host shows the dotnet process is indeed dominating the CPU

builder@anna-MacBookAir:~$ top
top - 20:57:08 up 42 days, 12:13,  1 user,  load average: 6.22, 5.62, 5.23
Tasks: 312 total,   2 running, 310 sleeping,   0 stopped,   0 zombie
%Cpu(s): 71.1 us, 20.5 sy,  0.0 ni,  8.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7867.8 total,    119.0 free,   1579.2 used,   6169.7 buff/cache
MiB Swap:   2048.0 total,   1964.0 free,     84.0 used.   6119.9 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3423403 root      20   0 2356504  97064  51388 S 351.3   1.2   2541:33 dotnet
3407742 root      20   0 1519296  90964  42816 S   4.0   1.1  10:18.26 process-agent
   1590 root      20   0  896744 151132  29452 S   2.6   1.9   6980:32 k3s-agent
3407662 root      20   0 2068672 165816  75396 S   1.7   2.1  37:26.54 agent
3407301 root      20   0 1523488  94708  52728 S   1.0   1.2   6:36.49 datadog-cluster
    935 gdm       20   0 4012732 103432  42976 S   0.7   1.3  47:10.45 gnome-shell
 562482 builder   20   0   12460   4208   3400 R   0.7   0.1   0:00.03 top
 745160 6789      20   0  850924 115836  46896 S   0.7   1.4 769:53.92 kuma-cp
3890605 5678      20   0 2298480  54904  25760 S   0.7   0.7   1022:01 envoy
    286 root      19  -1  314976 208984 207476 S   0.3   2.6  18:57.78 systemd-journal
   1626 root      20   0  900608 106900  31140 S   0.3   1.3   1025:35 containerd

I tried to nice the PID

builder@anna-MacBookAir:~$ sudo renice -n 10 -p 3423403
[sudo] password for builder:
3423403 (process ID) old priority 0, new priority 10

that failed

I tried setting the resource limits

builder@DESKTOP-72D2D9T:~$ kubectl get deployments dapr-workflows-host -o yaml > dapr-wf.dep.yaml
builder@DESKTOP-72D2D9T:~$ kubectl get deployments dapr-workflows-host -o yaml > dapr-wf.dep.yaml.backup
builder@DESKTOP-72D2D9T:~$ vi dapr-wf.dep.yaml
builder@DESKTOP-72D2D9T:~$ diff dapr-wf.dep.yaml dapr-wf.dep.yaml.backup
193,199d192
<         resources:
<           limits:
<             cpu: "1"
<             memory: 512Mi
<           requests:
<             cpu: "1"
<             memory: 512Mi
builder@DESKTOP-72D2D9T:~$ kubectl apply -f dapr-wf.dep.yaml
deployment.apps/dapr-workflows-host configured

I even cycled the pod, which scheduled it on a different host. But it did not fix the CPU usage

isaac@isaac-MacBookPro:~$ top
top - 21:54:23 up 42 days, 13:08,  2 users,  load average: 5.01, 3.34, 1.66
Tasks: 325 total,   2 running, 323 sleeping,   0 stopped,   0 zombie
%Cpu(s): 67.2 us, 23.7 sy,  0.0 ni,  8.2 id,  0.9 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :   7871.9 total,   2834.6 free,   1687.5 used,   3349.8 buff/cache
MiB Swap:   2048.0 total,   1463.1 free,    584.9 used.   6305.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2800732 root      20   0 2356768  88792  50532 S 340.5   1.1  13:08.12 dotnet
    992 root      20   0  895500 144356  29224 S   6.6   1.8   5933:57 k3s-agent
1231762 root      20   0 1519040  90948  43520 S   5.0   1.1  12:33.11 process-agent
   1558 root      20   0  927916 107160  36200 S   3.7   1.3   1009:12 containerd
 307704 5678      20   0 2298480  32860  11836 S   0.7   0.4 850:19.16 envoy
1230818 root      20   0  713572  13128   7392 S   0.7   0.2   2:05.58 containerd-shim
1231585 root      20   0 2428180 207572  75440 S   0.7   2.6  67:01.21 agent

More than a week passed since then and the fans died down.

checking the cluster today I see:

/content/images/2021/09/datadog-findings-37.png

while it’s more than I would like, we can see it doesnt seem to be pushing the host over the top just yet:

/content/images/2021/09/datadog-findings-38.png

The usage is still high, but not critical

Issues in Datadog (Incident Management)

A feature we haven’t explored to date is Datadog Incident Management. With Incident management we can create tickets to address issues and assign them to a team or person.

Here i’ll create one to address the High CPU later. I create from the “Incidents” tab. In this example, I’ll do it all entirely from my phone.

/content/images/2021/09/datadog-findings-25.jpg

I can set the severity and tie to APM logs as well by picking an Application if i desire

/content/images/2021/09/datadog-findings-24.jpg

Once saved, I can see details of this incident:

/content/images/2021/09/datadog-findings-23.jpg

I can then share it in a variety of ways:

/content/images/2021/09/datadog-findings-29.jpg

Such as copying the link and sharing it by SMS or to a Teams or Slack channel.

/content/images/2021/09/datadog-findings-28.jpg

/content/images/2021/09/datadog-findings-26.jpg

We can always look up details later under incidents:

/content/images/2021/09/datadog-findings-30.jpg

/content/images/2021/09/datadog-findings-27.jpg

There is not much you can modify by way of proces flow. The 3 states are Active, Resolved and Stable:

/content/images/2021/09/datadog-findings-32.jpg

We can also find Incident Management in Dashboards

/content/images/2021/09/datadog-findings-33.png

As well as “Incidents” under the Monitors heading

/content/images/2021/09/datadog-findings-34.png

From the new Notifications pane, we can send the incident from the web portal:

/content/images/2021/09/datadog-findings-35.png

which, if using email, looks like:

/content/images/2021/09/datadog-findings-36.png

Summary

We dug into the issues on the cluster. Highlighting the overcapacity host, we cleanuped up unused services and failing deployments. We cycled some front-end pods which immediately rescheduled to new nodes. Lastly, we removed a cluster in a cluster (k0s) and bounced the Dapr workflow agent, which was the root issue of the high CPU usage.

We updated the Datadog agent with a fresh helm chart and enabled process metrics. We explored those and then setup a Datadog monitor for Anomolies in CPU usage and showed what that would look like.

A quick pro-tip, for the purpose of seeing our cluster by size, we can sort our node pool by memory capacity:

$ kubectl get no -o json | jq -r '.items | sort_by(.status.capacity.memory)[]|[.metadata.name,.status.capacity.memory]| @tsv'
anna-macbookair 8056668Ki
isaac-macbookair        8057256Ki
isaac-macbookpro        8060876Ki

or CPU capacity

$ kubectl get no -o json | jq -r '.items | sort_by(.status.capacity.cpu)[]|[.metadata.name,.status.capacity.cpu]| @tsv'
anna-macbookair 4
isaac-macbookair        4
isaac-macbookpro        4

In our Epilogue section, we looked at some remediation ideas and then created a Datadog incident and showed how we can manage Incidents inside Datadog for basic work item tracking.

I know I come across as a fanboy, but all that I’ve demoed has stayed within the free tier of Datadog which shows how useful that offering can be.

datadog kubernetes k3s

Have something to add? Feedback? Try our new forums

Isaac Johnson

Isaac Johnson

Cloud Solutions Architect

Isaac is a CSA and DevOps engineer who focuses on cloud migrations and devops processes. He also is a dad to three wonderful daughters (hence the references to Princess King sprinkled throughout the blog).

Theme built by C.S. Rhymes