Published: Aug 16, 2023 by Isaac Johnson

Two distinct things happened:

First, this NAS below, SassyNassy, for which my 4-yo daughter was installing one of the two 6TB drives back in Aug 2016 below, finally died.

It died after another power outtage. One of the two NAS drives was already showing SMART errors and needed replacement, regardless.

I did fix it, running down to Microcenter to get a new one and archiving all photos to my new NAS (SirNasilot).

Second, within days of our power outtage, the A/C in our house died. It was not fun. I have no right to complain about hot. There are many places of the world in triple digit temps. But it was into the high 80s (F) in the basement and my laptop (MacBook air) was screaming hot and dying.

I had every feeling it would not survive and could only revive the primary cluster by force killing power and booting. At one point I even grabbed the laptop and put in my beer fridge for half an hour.

Let’s dig into the real root causes and some remediations. We will also setup MySQL (MariaDB) on the NAS in preparation for a new cluster as well as setup an NFS volume for testing before moving our primary cluster over.

Sonarqube

Say your sonarqube pod just keeps dying. Maybe it’s because you had a power outtage and or no proper cooling, or the PostgreSQL pod’s PVC is on a NAS that just went AWOL; regardless of the why, let’s focus on how your test result aggregation and code metrics app is endlessly taking a dump in your cluster…

builder@DESKTOP-QADGF36:~/Workspaces/AzureDevOpsAgent/deploy$ kubectl get pod sonarqube-ce-7f4d8997cb-kzk5n
NAME                            READY   STATUS             RESTARTS     AGE
sonarqube-ce-7f4d8997cb-kzk5n   0/1     CrashLoopBackOff   3 (6s ago)   85s

Basically it continues to die because of a stuck lock. something that it cannot unfoo on its own

builder@DESKTOP-QADGF36:~/Workspaces/AzureDevOpsAgent/deploy$ kubectl logs sonarqube-ce-7f4d8997cb-kzk5n sonarqube
sonarqube 11:21:16.54
sonarqube 11:21:16.55 Welcome to the Bitnami sonarqube container
sonarqube 11:21:16.55 Subscribe to project updates by watching https://github.com/bitnami/containers
sonarqube 11:21:16.56 Submit issues and feature requests at https://github.com/bitnami/containers/issues
sonarqube 11:21:16.56
sonarqube 11:21:16.57 INFO  ==> Validating settings in POSTGRESQL_CLIENT_* env vars
sonarqube 11:21:16.63 INFO  ==> Creating SonarQube configuration
sonarqube 11:21:16.72 INFO  ==> Trying to connect to the database server
sonarqube 11:21:16.77 INFO  ==> Restoring persisted SonarQube installation
sonarqube 11:21:16.87 INFO  ==> Setting heap size to -Xmx2048m -Xms1024m
sonarqube 11:21:16.91 INFO  ==> ** SonarQube setup finished! **

sonarqube 11:21:16.95 INFO  ==> ** Starting SonarQube **
/opt/bitnami/java/bin/java
Running SonarQube...
2023.08.01 11:21:17 INFO  app[][o.s.a.AppFileSystem] Cleaning or creating temp directory /opt/bitnami/sonarqube/temp
2023.08.01 11:21:17 INFO  app[][o.s.a.es.EsSettings] Elasticsearch listening on [HTTP: 127.0.0.1:9001, TCP: 127.0.0.1:44519]
2023.08.01 11:21:17 INFO  app[][o.s.a.ProcessLauncherImpl] Launch process[ELASTICSEARCH] from [/opt/bitnami/sonarqube/elasticsearch]: /opt/bitnami/sonarqube/elasticsearch/bin/elasticsearch
2023.08.01 11:21:17 INFO  app[][o.s.a.SchedulerImpl] Waiting for Elasticsearch to be up and running
2023.08.01 11:21:21 INFO  es[][o.e.n.Node] version[7.17.4], pid[145], build[default/tar/79878662c54c886ae89206c685d9f1051a9d6411/2022-05-18T18:04:20.964345128Z], OS[Linux/5.15.0-72-generic/amd64], JVM[BellSoft/OpenJDK 64-Bit Server VM/11.0.15.1/11.0.15.1+2-LTS]
2023.08.01 11:21:21 INFO  es[][o.e.n.Node] JVM home [/opt/bitnami/java]
2023.08.01 11:21:21 INFO  es[][o.e.n.Node] JVM arguments [-XX:+UseG1GC, -Djava.io.tmpdir=/opt/bitnami/sonarqube/temp, -XX:ErrorFile=../logs/es_hs_err_pid%p.log, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djna.tmpdir=/opt/bitnami/sonarqube/temp, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j2.formatMsgNoLookups=true, -Djava.locale.providers=COMPAT, -Dcom.redhat.fips=false, -Des.enforce.bootstrap.checks=true, -Xmx2048m, -Xms2048m, -XX:+HeapDumpOnOutOfMemoryError, -Des.path.home=/opt/bitnami/sonarqube/elasticsearch, -Des.path.conf=/opt/bitnami/sonarqube/temp/conf/es, -Des.distribution.flavor=default, -Des.distribution.type=tar, -Des.bundled_jdk=false]
2023.08.01 11:21:21 INFO  es[][o.e.p.PluginsService] loaded module [analysis-common]
2023.08.01 11:21:21 INFO  es[][o.e.p.PluginsService] loaded module [lang-painless]
2023.08.01 11:21:21 INFO  es[][o.e.p.PluginsService] loaded module [parent-join]
2023.08.01 11:21:21 INFO  es[][o.e.p.PluginsService] loaded module [reindex]
2023.08.01 11:21:21 INFO  es[][o.e.p.PluginsService] loaded module [transport-netty4]
2023.08.01 11:21:21 INFO  es[][o.e.p.PluginsService] no plugins loaded
2023.08.01 11:21:21 ERROR es[][o.e.b.ElasticsearchUncaughtExceptionHandler] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/opt/bitnami/sonarqube/data/es7]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170) ~[elasticsearch-7.17.4.jar:7.17.4]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:157) ~[elasticsearch-7.17.4.jar:7.17.4]
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:77) ~[elasticsearch-7.17.4.jar:7.17.4]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:112) ~[elasticsearch-cli-7.17.4.jar:7.17.4]
        at org.elasticsearch.cli.Command.main(Command.java:77) ~[elasticsearch-cli-7.17.4.jar:7.17.4]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:122) ~[elasticsearch-7.17.4.jar:7.17.4]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:80) ~[elasticsearch-7.17.4.jar:7.17.4]
Caused by: java.lang.IllegalStateException: failed to obtain node locks, tried [[/opt/bitnami/sonarqube/data/es7]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?
        at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:328) ~[elasticsearch-7.17.4.jar:7.17.4]
        at org.elasticsearch.node.Node.<init>(Node.java:429) ~[elasticsearch-7.17.4.jar:7.17.4]
        at org.elasticsearch.node.Node.<init>(Node.java:309) ~[elasticsearch-7.17.4.jar:7.17.4]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:234) ~[elasticsearch-7.17.4.jar:7.17.4]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:234) ~[elasticsearch-7.17.4.jar:7.17.4]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:434) ~[elasticsearch-7.17.4.jar:7.17.4]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:166) ~[elasticsearch-7.17.4.jar:7.17.4]
        ... 6 more
uncaught exception in thread [main]
java.lang.IllegalStateException: failed to obtain node locks, tried [[/opt/bitnami/sonarqube/data/es7]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?
        at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:328)
        at org.elasticsearch.node.Node.<init>(Node.java:429)
        at org.elasticsearch.node.Node.<init>(Node.java:309)
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:234)
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:234)
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:434)
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:166)
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:157)
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:77)
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:112)
        at org.elasticsearch.cli.Command.main(Command.java:77)
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:122)
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:80)
For complete error details, refer to the log at /opt/bitnami/sonarqube/logs/sonarqube.log
2023.08.01 11:21:22 WARN  app[][o.s.a.p.AbstractManagedProcess] Process exited with exit value [ElasticSearch]: 1
2023.08.01 11:21:22 INFO  app[][o.s.a.SchedulerImpl] Process[ElasticSearch] is stopped
2023.08.01 11:21:22 INFO  app[][o.s.a.SchedulerImpl] SonarQube is stopped

The only way out of this bus to nowhere is to scrub the messed up es7 directory.

To do that, I needed to add a “fix-elastic-error-lock-issue”

In the init containers block for the pod, I added a quick step to scrub the busted es7 (ElasticSearch) data that had gotten corrupted on a crash.

  initContainers:
  - command:
    - rm
    - -rf
    - /opt/bitnami/sonarqube/data/es7
    image: busybox
    imagePullPolicy: IfNotPresent
    name: fix-elastic-error-lock-issue
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /opt/bitnami/sonarqube/data
      name: data
      subPath: es7
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-4lsnm
      readOnly: true
  - command:

(note: you’ll find the volume mounts in your deploy of Sonar - they have changed over time so just lookup the data mount on the app and use those values)

Since it was in the PVC, no amount of restarts of pods would have cleared it naturally.

However, it continued to crash, but only because the Kubernetes main node kept crashing.

We will revist this later in the post…

NAS and Power

The biggest reason I had to drive cross town in the middle of a day and fetch a new 6tb SATA drive was power outtages finally killing my NAS.

I have saved the NAS, i might add. It’s (at the time of writing) between 6 and 12 days from fully syncing the second replacement drive

However, this is something I very well could have avoided by simply using the USB power feature all UPSes include

I added it to the larger UPS backing my cluster compute

And the smaller one dedicated to my older NAS and Wifi

We can then go to “Hardware & Power” in the control panel, choose “UPS” and select “USB UPS” from the UPS Type.

Clicking “Information” will show you the reported power levels

What should now happen is that in the event of a sustained outtage, the low power signal (which will arrive between 5 and 15 minutes depending on which UPS) will trigger the NAS to power down.

However, for those mid-summer blips - where my power just cuts for a second, nothing will be affected.

AWX Backup

Another concern, when faced with the idea that my primary K8s cluster was FUBAR, was losing some of my AWX configurations. I really do use AWX alot now and some stuff is hardcoded or teased out (like some complicated variable blocks).

To back things up, we first we determine the namespace and deployment

builder@DESKTOP-QADGF36:~/Workspaces/jekyll-blog$ kubectl get deployment -n adwerx
NAME        READY   UP-TO-DATE   AVAILABLE   AGE
adwerxawx   1/1     1            1           376d

Then I can create a backup job

$ cat backup-awx.yaml
---
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
  name: awxbackup-2023-08-06
  namespace: adwerx
spec:
  deployment_name: adwerxawx

Since, in my case, I used the AWX Helm chart and not CRDs, I can’t use the CRD based backups.

$ kubectl apply -f backup-awx.yaml
error: resource mapping not found for name: "awxbackup-2023-08-06" namespace: "adwerx" from "backup-awx.yaml": no matches for kind "AWXBackup" in version "awx.ansible.com/v1beta1"
ensure CRDs are installed first

However, I can create a Job that would install the AWX CLI and use that to dump all the objects to files.

I could use a PVC, but to keep it simple, I’ll save them to a /tmp folder on the local host

$ cat backup2.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: awxbackup4
  namespace: adwerx
spec:
  backoffLimit: 4
  template:
    spec:
      containers:
      - name: test
        image: alpine
        envFrom:
        - secretRef:
            name: adwerxawx
        volumeMounts:
        - mountPath: /backup
          name: tmpvolume
        command:
        - bin/sh
        - -c
        - |
          apk --no-cache add curl
          apk --no-cache add jq
          apk --no-cache add python
          apk --no-cache add py3-pip

          pip3 install awxkit

          export DSTR=`date +"%m-%d-%Y"`
          mkdir -p /backup/$DSTR
          cd /backup/$DSTR

          awx --conf.username $AWX_ADMIN_USER --conf.password $AWX_ADMIN_PASSWORD --conf.host http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP export --conf.insecure --users -f json > awxusers.json
          awx --conf.username $AWX_ADMIN_USER --conf.password $AWX_ADMIN_PASSWORD --conf.host http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP export --conf.insecure --organizations -f json > awxorganizations.json
          awx --conf.username $AWX_ADMIN_USER --conf.password $AWX_ADMIN_PASSWORD --conf.host http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP export --conf.insecure --teams -f json > awxteams.json
          awx --conf.username $AWX_ADMIN_USER --conf.password $AWX_ADMIN_PASSWORD --conf.host http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP export --conf.insecure --credential_types -f json > awxcredentialtypes.json
          awx --conf.username $AWX_ADMIN_USER --conf.password $AWX_ADMIN_PASSWORD --conf.host http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP export --conf.insecure --credentials -f json > awxcredentials.json
          awx --conf.username $AWX_ADMIN_USER --conf.password $AWX_ADMIN_PASSWORD --conf.host http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP export --conf.insecure --notification_templates -f json > awxnotificationtemplates.json
          awx --conf.username $AWX_ADMIN_USER --conf.password $AWX_ADMIN_PASSWORD --conf.host http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP export --conf.insecure --projects -f json > awxprojects.json
          awx --conf.username $AWX_ADMIN_USER --conf.password $AWX_ADMIN_PASSWORD --conf.host http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP export --conf.insecure --inventory -f json > awxinventory.json
          awx --conf.username $AWX_ADMIN_USER --conf.password $AWX_ADMIN_PASSWORD --conf.host http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP export --conf.insecure --inventory_sources -f json > awxinventorysources.json
          awx --conf.username $AWX_ADMIN_USER --conf.password $AWX_ADMIN_PASSWORD --conf.host http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP export --conf.insecure --job_templates -f json  > awxjobtemplates.json
          awx --conf.username $AWX_ADMIN_USER --conf.password $AWX_ADMIN_PASSWORD --conf.host http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP export --conf.insecure --workflow_job_templates -f json > awxwfjobtemplates.json
          awx --conf.username $AWX_ADMIN_USER --conf.password $AWX_ADMIN_PASSWORD --conf.host http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP export --conf.insecure --execution_environments -f json > awsexecutionenvironments.json
          awx --conf.username $AWX_ADMIN_USER --conf.password $AWX_ADMIN_PASSWORD --conf.host http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP export --conf.insecure --applications -f json > awsapplications.json
          awx --conf.username $AWX_ADMIN_USER --conf.password $AWX_ADMIN_PASSWORD --conf.host http://$ADWERXAWX_SERVICE_HOST:$ADWERXAWX_SERVICE_PORT_HTTP export --conf.insecure --schedules -f json > awxschedules.json
      volumes:
      - name: tmpvolume
        hostPath:
          path: /tmp
      restartPolicy: Never

I can then launch it:

$ kubectl apply -f backup2.yaml
job.batch/awxbackup4 created

I’ll want to hop to that Node to view /tmp

$ kubectl describe pod awxbackup4-cnnth -n adwerx | grep Node
Node:             hp-hp-elitebook-850-g2/192.168.1.57
Node-Selectors:              <none>

hp@hp-HP-EliteBook-850-G2:~$ ls -l /tmp/08-06-2023/
total 192
-rw-r--r-- 1 root root    28 Aug  6 17:17 awsapplications.json
-rw-r--r-- 1 root root     0 Aug  6 17:17 awsexecutionenvironments.json
-rw-r--r-- 1 root root 13265 Aug  6 17:17 awxcredentials.json
-rw-r--r-- 1 root root 48824 Aug  6 17:17 awxcredentialtypes.json
-rw-r--r-- 1 root root 15515 Aug  6 17:17 awxinventory.json
-rw-r--r-- 1 root root    33 Aug  6 17:17 awxinventorysources.json
-rw-r--r-- 1 root root 53032 Aug  6 17:17 awxjobtemplates.json
-rw-r--r-- 1 root root    38 Aug  6 17:17 awxnotificationtemplates.json
-rw-r--r-- 1 root root   673 Aug  6 17:17 awxorganizations.json
-rw-r--r-- 1 root root  4828 Aug  6 17:17 awxprojects.json
-rw-r--r-- 1 root root  9681 Aug  6 17:18 awxschedules.json
-rw-r--r-- 1 root root  2297 Aug  6 17:17 awxteams.json
-rw-r--r-- 1 root root 15188 Aug  6 17:16 awxusers.json
-rw-r--r-- 1 root root    38 Aug  6 17:17 awxwfjobtemplates.json

I can now view the objects such as teams

$ cat /tmp/08-06-2023/awxteams.json
{
     "teams": [
          {
               "name": "MyTeam",
               "description": "",
               "organization": {
                    "name": "onprem",
                    "type": "organization"
               },
               "related": {
                    "roles": [],
                    "object_roles": [
                         {
                              "name": "Admin",
                              "type": "role",
                              "content_object": {
                                   "organization": {
                                        "name": "onprem",
                                        "type": "organization"
                                   },
                                   "name": "MyTeam",
                                   "type": "team"
                              }
                         },
                         {
                              "name": "Member",
                              "type": "role",
                              "content_object": {
                                   "organization": {
                                        "name": "onprem",
                                        "type": "organization"
                                   },
                                   "name": "MyTeam",
                                   "type": "team"
                              }
                         },
                         {
                              "name": "Read",
                              "type": "role",
                              "content_object": {
                                   "organization": {
                                        "name": "onprem",
                                        "type": "organization"
                                   },
                                   "name": "MyTeam",
                                   "type": "team"
                              }
                         }
                    ]
               },
               "natural_key": {
                    "organization": {
                         "name": "onprem",
                         "type": "organization"
                    },
                    "name": "MyTeam",
                    "type": "team"
               }
          }
     ]
}

If I wanted to import, I could that as well with a job

/ # awx --conf.username xxxxxx --conf.password xxxxxxx --conf.host https://awx.freshbrewed.science/ import -h
usage: awx import < exportfile

import resources from stdin

options:
  -h, --help            show this help message and exit

input/output formatting:
  -f {json,yaml}, --conf.format {json,yaml}
                        specify a format for the input and output
  -v, --verbose         print debug-level logs, including requests made

NFS

As we have moved to a new NFS, it’s time to create a NFS provisioner setup.

I’ll set this up first on the test cluster

On my new nas (SirNasilot), I go to File Status and “Create shared folder”

I use the main Volume and give a description

I’ll set no guest access but read-write for others

I’ll now edit the share and go to NFS Permissions. Here I can click create

I’ll enable my wider on-prem network with 192.168.0.0/16. I could have narrowed it to 192.168.1.* using 192.168.1.0/24

I can now use it.

We can use a HelmChart object to pull and install a helm chart using YAML

builder@anna-MacBookAir:~$ cat enable_nfs.yaml
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: nfs
  namespace: default
spec:
  chart: nfs-subdir-external-provisioner
  repo: https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner
  targetNamespace: default
  set:
    nfs.server: 192.168.1.117
    nfs.path: /volume1/k3snfs81b230805
    storageClass.name: nfs
builder@anna-MacBookAir:~$ kubectl apply -f enable_nfs.yaml
helmchart.helm.cattle.io/nfs created

For some reason, the job hung on me to install that way:

builder@DESKTOP-QADGF36:~$ kubectl logs helm-install-nfs-np9x8 | tail -n10
+ [[ ! -f /chart/nfs.tgz.base64 ]]
+ return
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ nfs-subdir-external-provisioner == stable/* ]]
+ [[ -n https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner ]]
+ helm_v3 repo add nfs https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner
Error: looks like "https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner" is not a valid chart repository or cannot be reached: Get "https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/index.yaml": dial tcp: lookup kubernetes-sigs.github.io on 10.43.0.10:53: read udp 10.42.2.24:40311->10.43.0.10:53: i/o timeout

So i just did the traditional

$ helm repo add nfs https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner
"nfs" has been added to your repositories

$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "kube-state-metrics" chart repository
...Successfully got an update from the "nfs" chart repository
...Successfully got an update from the "longhorn" chart repository
...Successfully got an update from the "adwerx" chart repository
...Successfully got an update from the "kuma" chart repository
...Successfully got an update from the "confluentinc" chart repository
...Successfully got an update from the "zabbix-community" chart repository
...Successfully got an update from the "novum-rgi-helm" chart repository
...Successfully got an update from the "azure-samples" chart repository
...Successfully got an update from the "opencost" chart repository
...Successfully got an update from the "akomljen-charts" chart repository
...Successfully got an update from the "btungut" chart repository
...Successfully got an update from the "dapr" chart repository
...Successfully got an update from the "rhcharts" chart repository
...Successfully got an update from the "myharbor" chart repository
...Successfully got an update from the "freshbrewed" chart repository
...Successfully got an update from the "portainer" chart repository
...Successfully got an update from the "jfelten" chart repository
...Successfully got an update from the "actions-runner-controller" chart repository
...Successfully got an update from the "open-telemetry" chart repository
...Successfully got an update from the "epsagon" chart repository
...Successfully got an update from the "kubecost" chart repository
...Successfully got an update from the "lifen-charts" chart repository
...Successfully got an update from the "castai-helm" chart repository
...Successfully got an update from the "sonarqube" chart repository
...Successfully got an update from the "elastic" chart repository
...Successfully got an update from the "nginx-stable" chart repository
...Successfully got an update from the "datadog" chart repository
...Successfully got an update from the "argo-cd" chart repository
...Successfully got an update from the "signoz" chart repository
...Successfully got an update from the "rook-release" chart repository
...Successfully got an update from the "harbor" chart repository
...Successfully got an update from the "kiwigrid" chart repository
...Successfully got an update from the "grafana" chart repository
...Successfully got an update from the "crossplane-stable" chart repository
...Successfully got an update from the "prometheus-community" chart repository
...Successfully got an update from the "rancher-latest" chart repository
...Successfully got an update from the "sumologic" chart repository
...Successfully got an update from the "ngrok" chart repository
...Successfully got an update from the "hashicorp" chart repository
...Successfully got an update from the "gitea-charts" chart repository
...Successfully got an update from the "uptime-kuma" chart repository
...Successfully got an update from the "newrelic" chart repository
...Successfully got an update from the "openzipkin" chart repository
...Successfully got an update from the "bitnami" chart repository
...Successfully got an update from the "incubator" chart repository
...Successfully got an update from the "gitlab" chart repository
Update Complete. ⎈Happy Helming!⎈


$ helm install nfs-subdir-external-provisioner --set nfs.server=192.168.1.117 --set nfs.path=/volume1/k3nfs81b230805 --set storageClass.name=nfs nfs/nfs-subdir-external-provisioner
NAME: nfs-subdir-external-provisioner
LAST DEPLOYED: Sat Aug  5 18:12:54 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

I can now see the SC

builder@DESKTOP-QADGF36:~$ kubectl get sc
NAME                   PROVISIONER                                     RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)   rancher.io/local-path                           Delete          WaitForFirstConsumer   false                  26d
nfs                    cluster.local/nfs-subdir-external-provisioner   Delete          Immediate              true                   49s

I realized a bit later I had a typo in the nfs.path (forgot the s)

$ helm upgrade --install nfs-subdir-external-provisioner --set nfs.server=192.168.1.117 --set nfs.path=/volume1/k3snfs81b230805 --set storageClass.name=nfs
nfs/nfs-subdir-external-provisioner
Release "nfs-subdir-external-provisioner" has been upgraded. Happy Helming!
NAME: nfs-subdir-external-provisioner
LAST DEPLOYED: Sat Aug  5 18:32:56 2023
NAMESPACE: default
STATUS: deployed
REVISION: 3
TEST SUITE: None
builder@DESKTOP-QADGF36:~$

I can now test

$ cat testNFS.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfsclaim
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: nfs
  resources:
    requests:
      storage: 110Mi

And see it binds

$ kubectl get pvc
NAME                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-gitea-0              Bound    pvc-bffada16-6a38-4968-be79-604ede967b13   10Gi       RWO            local-path     19d
data-gitea-postgresql-0   Bound    pvc-1c317391-7d2c-405d-a505-d8fe42c44f03   10Gi       RWO            local-path     19d
act-runner-vol            Bound    pvc-b15a377c-8d9f-427f-9e52-d9b6cf457dc0   1Gi        RWO            local-path     19d
configs-pvc               Bound    pvc-63a24130-ae57-41d6-a158-3f96fa2df2d0   1Gi        RWO            local-path     12d
icons-pvc                 Bound    pvc-4fd445e3-0fc3-42a5-a604-d3f9855d8736   1Gi        RWO            local-path     12d
nfsclaim                  Bound    pvc-a5b7daa8-03e2-40b5-b626-79f8b4b0652c   110Mi      RWO            nfs            19m

One can also test locally

builder@builder-MacBookPro2:~$ sudo mkdir /tmptest
builder@builder-MacBookPro2:~$ sudo mount -t nfs 192.168.1.117:/volume1/k3snfs81b230805 /tmptest

Lastly, I can see the test PVC in the NAS

I can also make this storageclass the default

builder@DESKTOP-QADGF36:~$ kubectl get sc
NAME                   PROVISIONER                                     RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)   rancher.io/local-path                           Delete          WaitForFirstConsumer   false                  26d
nfs                    cluster.local/nfs-subdir-external-provisioner   Delete          Immediate              true                   23m
builder@DESKTOP-QADGF36:~$ kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
storageclass.storage.k8s.io/local-path patched
builder@DESKTOP-QADGF36:~$ kubectl patch storageclass nfs -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
storageclass.storage.k8s.io/nfs patched
builder@DESKTOP-QADGF36:~$ kubectl get sc
NAME            PROVISIONER                                     RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path      rancher.io/local-path                           Delete          WaitForFirstConsumer   false                  26d
nfs (default)   cluster.local/nfs-subdir-external-provisioner   Delete          Immediate              true                   24m

Mysql (MariaDB) HA mode for K3s

We need to create a database. I’ll want to set the root password

I can check the active users that can login today

ijohnson@sirnasilot:~$ /usr/local/mariadb10/bin/mysql -u root -p
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 122
Server version: 10.3.37-MariaDB Source distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SELECT User, Host FROM mysql.user WHERE Host <> 'localhost';
+-------+-----------+
| User  | Host      |
+-------+-----------+
| gitea | %         |
| root  | 127.0.0.1 |
| root  | ::1       |
+-------+-----------+
3 rows in set (0.001 sec)

I now want a K3s user/password and database to use

MariaDB [(none)]> CREATE USER 'k3s77'@'%' IDENTIFIED BY 'Kubeapi-77';
Query OK, 0 rows affected (0.028 sec)

MariaDB [(none)]> GRANT ALL PRIVILEGES ON *.* TO 'k3s77'@'%' IDENTIFIED BY 'Kubeapi-77' WITH GRANT OPTION;
Query OK, 0 rows affected (0.027 sec)

MariaDB [(none)]> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.002 sec)

I’ll want to test this as I can see grants from “%” but not a specific host:

MariaDB [(none)]> SHOW GRANTS FOR 'k3s77'@'192.168.1.78';
ERROR 1141 (42000): There is no such grant defined for user 'k3s77' on host '192.168.1.78'
MariaDB [(none)]> SHOW GRANTS FOR 'k3s77'@'%';
+---------------------------------------------------------------------------------------------------------------------------------+
| Grants for k3s77@%                                                                                                              |
+---------------------------------------------------------------------------------------------------------------------------------+
| GRANT ALL PRIVILEGES ON *.* TO `k3s77`@`%` IDENTIFIED BY PASSWORD '*09E15244EF0FD0E02B85AC9EDA91E4810F6C15D6' WITH GRANT OPTION |
+---------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.000 sec)

Saving Configs and Secrets

If I’m going to wipe a server, I better stash some secrets. I already copied down some key containers

builder@DESKTOP-QADGF36:~$ kubectl get secrets -o yaml > ~/k3s-77-2023-08-05-secrets.yaml
builder@DESKTOP-QADGF36:~$ kubectl get cm -o yaml > ~/k3s-77-2023-08-05-cm.yaml

WTF

Convinced I had a host going to pot, I was planning on destroying and re-creating..

I looked at BTOP and indeed k3s was pegged, but then i noticed in Datadog the number of containers listed active on one host.. wait that does not make sense

That is WAY too many AzDO runners

I popped in and saw that it has scaled up to 97!

builder@DESKTOP-QADGF36:~/Workspaces$ kubectl describe runnerdeployment my-azdo-deployment
Name:         my-azdo-deployment
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  actions.summerwind.dev/v1alpha1
Kind:         RunnerDeployment
Metadata:
  Creation Timestamp:  2023-08-01T11:32:29Z
  Generation:          1
  Resource Version:    194448313
  UID:                 219d4f44-21a7-4c3c-b6bb-db5a54b8c521
Spec:
  Effective Time:  <nil>
  Selector:        <nil>
  Template:
    Metadata:
    Spec:
      Docker Enabled:  true
      Dockerd Container Resources:
      Env:
        Name:   AZP_POOL
        Value:  k3s77-self-hosted
        Name:   AZP_URL
        Value:  https://dev.azure.com/princessking
        Name:   AZP_TOKEN
        Value From:
          Secret Key Ref:
            Key:          azdopat
            Name:         azdopat
            Optional:     false
      Image:              idjohnson/azdoagent:0.1.0
      Image Pull Policy:  IfNotPresent
      Labels:
        my-azdo-deployment
      Repository:  idjohnson/ansible-playbooks
      Resources:
Status:
  Available Replicas:  1
  Desired Replicas:    1
  Ready Replicas:      1
  Replicas:            97
  Updated Replicas:    97
Events:                <none>

My agent count was super high

And btop was showing a pegged host

I clearly did something really stoopid

builder@DESKTOP-QADGF36:~/Workspaces$ kubectl get runnerdeployment
NAME                          DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
new-jekyllrunner-deployment   5         5         5            5           192d
my-azdo-deployment                      97        97           1           4d12h

I tried to force set replicas to 2 and it still kept scaling up!

$ kubectl describe runnerdeployment my-azdo-deployment
Name:         my-azdo-deployment
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  actions.summerwind.dev/v1alpha1
Kind:         RunnerDeployment
Metadata:
  Creation Timestamp:  2023-08-01T11:32:29Z
  Generation:          2
  Resource Version:    194452014
  UID:                 219d4f44-21a7-4c3c-b6bb-db5a54b8c521
Spec:
  Effective Time:  <nil>
  Replicas:        2
  Selector:        <nil>
  Template:
    Metadata:
    Spec:
      Docker Enabled:  true
      Dockerd Container Resources:
      Env:
        Name:   AZP_POOL
        Value:  k3s77-self-hosted
        Name:   AZP_URL
        Value:  https://dev.azure.com/princessking
        Name:   AZP_TOKEN
        Value From:
          Secret Key Ref:
            Key:          azdopat
            Name:         azdopat
            Optional:     false
      Image:              idjohnson/azdoagent:0.1.0
      Image Pull Policy:  IfNotPresent
      Labels:
        my-azdo-deployment
      Repository:  idjohnson/ansible-playbooks
      Resources:
Status:
  Available Replicas:  2
  Desired Replicas:    2
  Ready Replicas:      2
  Replicas:            99
  Updated Replicas:    99
Events:                <none>

It was going to scale until the host fell over (again).

I saved aside this mistake and deleted the deployment

builder@DESKTOP-QADGF36:~/Workspaces$ kubectl get runnerdeployment my-azdo-deployment -o yaml > runnerdeployment.azdo.bigissue.yaml
builder@DESKTOP-QADGF36:~/Workspaces$ kubectl delete runnerdeployment my-azdo-deployment
runnerdeployment.actions.summerwind.dev "my-azdo-deployment" deleted

There were still WAY too many

builder@DESKTOP-QADGF36:~/Workspaces$ kubectl get pods -l runner-deployment-name=my-azdo-deployment
NAME                             READY   STATUS        RESTARTS   AGE
my-azdo-deployment-ghmfn-hfrlg   2/2     Running       0          4d11h
my-azdo-deployment-ghmfn-fddh7   2/2     Running       0          4d6h
my-azdo-deployment-ghmfn-7f2mq   2/2     Running       0          4d4h
my-azdo-deployment-ghmfn-qdtnn   2/2     Running       0          4d10h
my-azdo-deployment-ghmfn-qcbp9   2/2     Running       0          4d6h
my-azdo-deployment-ghmfn-pqdx8   2/2     Running       0          4d10h
my-azdo-deployment-ghmfn-l2xpw   2/2     Running       0          4d8h
my-azdo-deployment-ghmfn-7wclm   2/2     Running       0          4d7h
my-azdo-deployment-ghmfn-bhxhj   2/2     Running       0          4d12h
my-azdo-deployment-ghmfn-9xrm4   2/2     Running       0          4d5h
my-azdo-deployment-ghmfn-6k6h6   2/2     Running       0          4d10h
my-azdo-deployment-ghmfn-5cml2   2/2     Running       0          4d11h
my-azdo-deployment-ghmfn-xw654   2/2     Running       0          4d10h
my-azdo-deployment-ghmfn-7jb9r   2/2     Running       0          4d6h
my-azdo-deployment-ghmfn-r5xwg   2/2     Running       0          4d6h
my-azdo-deployment-ghmfn-nmsdf   2/2     Running       0          4d5h
my-azdo-deployment-ghmfn-f5vqr   2/2     Running       0          4d9h
my-azdo-deployment-ghmfn-n9j6v   2/2     Running       0          4d9h
my-azdo-deployment-ghmfn-xqqsd   2/2     Running       0          4d7h
my-azdo-deployment-ghmfn-7x6tj   2/2     Running       0          4d7h
my-azdo-deployment-ghmfn-nsvbj   2/2     Running       0          4d9h
my-azdo-deployment-ghmfn-7wbxv   2/2     Running       0          4d6h
my-azdo-deployment-ghmfn-rf84l   2/2     Running       0          4d5h
my-azdo-deployment-ghmfn-47zwh   2/2     Running       0          4d2h
my-azdo-deployment-ghmfn-vjqzl   2/2     Running       0          4d2h
my-azdo-deployment-ghmfn-v4l5b   2/2     Running       0          4d2h
my-azdo-deployment-ghmfn-jcfrc   2/2     Running       0          4d2h
my-azdo-deployment-ghmfn-8ddkd   2/2     Running       0          4d2h
my-azdo-deployment-ghmfn-8bjtz   2/2     Running       0          4d1h
my-azdo-deployment-ghmfn-zjs26   2/2     Running       0          4d1h
my-azdo-deployment-ghmfn-hwcmd   2/2     Running       0          4d1h
my-azdo-deployment-ghmfn-2tzbs   2/2     Running       0          4d1h
my-azdo-deployment-ghmfn-xrtbp   2/2     Running       0          4d1h
my-azdo-deployment-ghmfn-4klj6   2/2     Running       0          4d1h
my-azdo-deployment-ghmfn-fv2z2   2/2     Running       0          4d1h
my-azdo-deployment-ghmfn-2vsx7   2/2     Running       0          4d1h
my-azdo-deployment-ghmfn-jsgqb   2/2     Running       0          4d1h
my-azdo-deployment-ghmfn-qg7bp   2/2     Running       0          4d
my-azdo-deployment-ghmfn-bc2pz   2/2     Running       0          4d5h
my-azdo-deployment-ghmfn-7wntl   2/2     Running       0          4d10h
my-azdo-deployment-ghmfn-kdwcw   2/2     Running       0          4d11h
my-azdo-deployment-ghmfn-s5j2n   2/2     Running       0          4d11h
my-azdo-deployment-ghmfn-w4r87   2/2     Running       0          4d9h
my-azdo-deployment-ghmfn-5khs6   2/2     Running       0          4d9h
my-azdo-deployment-ghmfn-d9jhp   2/2     Running       0          4d12h
my-azdo-deployment-ghmfn-pmdg4   2/2     Running       0          4d8h
my-azdo-deployment-ghmfn-jmgpz   2/2     Running       0          4d
my-azdo-deployment-ghmfn-xtw9v   2/2     Running       0          4d8h
my-azdo-deployment-ghmfn-mdrj6   2/2     Running       0          4d4h
my-azdo-deployment-ghmfn-rpvbj   2/2     Running       0          4d10h
my-azdo-deployment-ghmfn-jbngp   2/2     Running       0          4d8h
my-azdo-deployment-ghmfn-mqx8x   2/2     Running       0          4d7h
my-azdo-deployment-ghmfn-rwkng   2/2     Running       0          4d8h
my-azdo-deployment-ghmfn-wtqm2   2/2     Running       0          4d11h
my-azdo-deployment-ghmfn-fl79k   2/2     Running       0          4d5h
my-azdo-deployment-ghmfn-qd68b   2/2     Running       0          4d8h
my-azdo-deployment-ghmfn-5t87x   2/2     Running       0          4d4h
my-azdo-deployment-ghmfn-2j7n6   2/2     Running       0          4d4h
my-azdo-deployment-ghmfn-hd5cs   2/2     Running       0          4d10h
my-azdo-deployment-ghmfn-b6glw   2/2     Running       0          2d10h
my-azdo-deployment-ghmfn-xkfqg   2/2     Running       0          2d10h
my-azdo-deployment-ghmfn-h8jqj   2/2     Running       0          2d9h
my-azdo-deployment-ghmfn-fzn4z   2/2     Running       0          70m
my-azdo-deployment-ghmfn-rq5kq   2/2     Running       0          70m
my-azdo-deployment-ghmfn-s7548   2/2     Running       0          81m
my-azdo-deployment-ghmfn-9ts2s   2/2     Running       0          49m
my-azdo-deployment-ghmfn-lrc2c   2/2     Running       0          49m
my-azdo-deployment-ghmfn-7krnd   2/2     Running       0          60m
my-azdo-deployment-ghmfn-bbjph   2/2     Running       0          39m
my-azdo-deployment-ghmfn-cmh72   2/2     Running       0          39m
my-azdo-deployment-ghmfn-7lth7   2/2     Running       0          29m
my-azdo-deployment-ghmfn-dsncv   2/2     Running       0          29m
my-azdo-deployment-ghmfn-thtn8   2/2     Running       0          8m5s
my-azdo-deployment-ghmfn-tr2bb   2/2     Running       0          18m
my-azdo-deployment-ghmfn-w442v   0/2     Terminating   0          4d12h
my-azdo-deployment-ghmfn-xswpf   0/2     Terminating   0          4d8h
my-azdo-deployment-ghmfn-49hkd   0/2     Terminating   0          4d7h
my-azdo-deployment-ghmfn-pdg4d   2/2     Terminating   0          4d5h
my-azdo-deployment-ghmfn-k74cm   2/2     Terminating   0          4d8h
my-azdo-deployment-ghmfn-jz9ls   0/2     Terminating   0          4d9h
my-azdo-deployment-ghmfn-6z9ff   2/2     Terminating   0          4d11h
my-azdo-deployment-ghmfn-h8wmt   2/2     Terminating   0          4d4h
my-azdo-deployment-ghmfn-z528t   0/2     Terminating   0          4d
my-azdo-deployment-ghmfn-78jhs   0/2     Terminating   0          4d8h
my-azdo-deployment-ghmfn-lmrfv   2/2     Terminating   0          4d5h
my-azdo-deployment-ghmfn-fn5bm   0/2     Terminating   0          4d12h
my-azdo-deployment-ghmfn-78zr4   0/2     Terminating   0          4d2h
my-azdo-deployment-ghmfn-q4kv8   2/2     Terminating   0          8m9s
my-azdo-deployment-ghmfn-2r4qp   2/2     Terminating   0          4d6h
my-azdo-deployment-ghmfn-2nzzl   2/2     Terminating   0          4d11h
my-azdo-deployment-ghmfn-q44ql   0/2     Terminating   0          4d12h
my-azdo-deployment-ghmfn-4d49l   2/2     Terminating   0          4d7h
my-azdo-deployment-ghmfn-vnc9j   2/2     Terminating   0          4d3h
my-azdo-deployment-ghmfn-zttxz   0/2     Terminating   0          4d9h

Time to go nuclear

builder@DESKTOP-QADGF36:~/Workspaces$ kubectl delete pods -l runner-deployment-name=my-azdo-deployment
pod "my-azdo-deployment-ghmfn-hfrlg" deleted
pod "my-azdo-deployment-ghmfn-fddh7" deleted
pod "my-azdo-deployment-ghmfn-7f2mq" deleted

When done, I finally started to see my laptop start to come down

By the next morning, all was well again

Sonarqube (part 2)

I promised earlier we would revisit sonar. After the insane crashing of the system was brought into order, Sonarqube came back

I could see the new pod held up

The logs also show the active pod is working

It took a bit, but on reboot it did clear and start to run again.

Summary

Today we learned some lessons, right friends? First - don’t yank the power out of a NAS that is already in trouble. You’ll have a bad day.

Second, if a host just keeps burning up, even if you are because of shotty A/C, check if you have a ridonculous number of pods running on it. If you have a replicaset that grows exponentially, you’re going to have a rough go.

We started prep work for a new cluster - namely setting up MariaDB and NFS. We showed how to backup AWX and lastly a quick fix for Sonarqube.