Published: Nov 25, 2023 by Isaac Johnson
Recently, I hopped on my Immich site and found that a crashing pod was holding the whole system up. I decided to write up how to find and fix as this has happened to me before on other containerized services that have local file dependencies.
It was also a good opportunities to build out a basic site watcher using scheduled jobs in AWX (using bash, PagerDuty and Resend.dev).
The Problem
Simply put, I found the site was down.
I panicked that it was due to a cluster crash, which happens at times.
Looking into the pods in the namespace:
$ kubectl get pods -n immich
NAME READY STATUS RESTARTS AGE
immichprod-redis-master-0 1/1 Running 0 5d18h
immichprod-web-dd4b87954-g2n56 1/1 Running 0 21d
immichprod-machine-learning-75d84dd96f-rzswr 1/1 Running 0 5d18h
immichprod-postgresql-0 1/1 Running 0 21d
immichprod-proxy-764698ffb4-ncmst 1/1 Running 12 (4d16h ago) 5d18h
immichprod-typesense-55bf4bd8d6-z77zc 0/1 CrashLoopBackOff 1633 (109s ago) 5d18h
immichprod-server-c6698b848-lxs59 0/1 Running 2132 (5m56s ago) 5d18h
immichprod-microservices-5b969c856d-2wnq8 1/1 Running 917 (5m43s ago) 5d18h
I could see two were stuck in a crashloop.
My first try is to always just rotate them and see if they come back up:
$ kubectl delete pod immichprod-typesense-55bf4bd8d6-z77zc -n immich && kubectl delete pod immichprod-server-c6698b848-lxs59 -n immich
pod "immichprod-typesense-55bf4bd8d6-z77zc" deleted
pod "immichprod-server-c6698b848-lxs59" deleted
$ kubectl get pods -n immich
NAME READY STATUS RESTARTS AGE
immichprod-redis-master-0 1/1 Running 0 5d18h
immichprod-web-dd4b87954-g2n56 1/1 Running 0 21d
immichprod-machine-learning-75d84dd96f-rzswr 1/1 Running 0 5d18h
immichprod-postgresql-0 1/1 Running 0 21d
immichprod-proxy-764698ffb4-ncmst 1/1 Running 12 (4d16h ago) 5d18h
immichprod-microservices-5b969c856d-2wnq8 1/1 Running 917 (7m10s ago) 5d18h
immichprod-server-c6698b848-7nlj4 0/1 Running 0 46s
immichprod-typesense-55bf4bd8d6-sg49p 0/1 CrashLoopBackOff 2 (15s ago) 48s
The logs were do to a local filesystem DB being open when a pod crashed (likely during a forced reboot)
$ kubectl logs immichprod-typesense-55bf4bd8d6-sg49p -n immich
I20231121 11:49:56.728648 1 typesense_server_utils.cpp:357] Starting Typesense 0.24.0
I20231121 11:49:56.728678 1 typesense_server_utils.cpp:360] Typesense is using jemalloc.
I20231121 11:49:56.728938 1 typesense_server_utils.cpp:409] Thread pool size: 32
I20231121 11:49:56.732545 1 store.h:63] Initializing DB by opening state dir: /tsdata/db
E20231121 11:49:56.744969 1 store.h:67] Error while initializing store: IO error: While lock file: /tsdata/db/LOCK: Resource temporarily unavailable
E20231121 11:49:56.744992 1 store.h:69] It seems like the data directory /tsdata/db is already being used by another Typesense server.
E20231121 11:49:56.744998 1 store.h:71] If you are SURE that this is not the case, delete the LOCK file in the data db directory and try again.
I20231121 11:49:56.745041 1 store.h:63] Initializing DB by opening state dir: /tsdata/meta
E20231121 11:49:56.768613 1 store.h:67] Error while initializing store: IO error: While lock file: /tsdata/meta/LOCK: Resource temporarily unavailable
E20231121 11:49:56.768626 1 store.h:69] It seems like the data directory /tsdata/meta is already being used by another Typesense server.
E20231121 11:49:56.768631 1 store.h:71] If you are SURE that this is not the case, delete the LOCK file in the data db directory and try again.
E20231121 11:49:56.894925 1 backward.hpp:4199] Stack trace (most recent call last):
E20231121 11:49:56.894938 1 backward.hpp:4199] #5 Object "/opt/typesense-server", at 0x4c8f50, in _start
E20231121 11:49:56.894943 1 backward.hpp:4199] #4 Object "/usr/lib/x86_64-linux-gnu/libc-2.31.so", at 0x7fe50f528082, in __libc_start_main
E20231121 11:49:56.894946 1 backward.hpp:4199] #3 Source "/typesense/src/main/typesense_server.cpp", line 160, in main [0x4999b2]
E20231121 11:49:56.894950 1 backward.hpp:4199] #2 Source "/typesense/src/typesense_server_utils.cpp", line 452, in run_server [0x6f2d8c]
E20231121 11:49:56.894954 1 backward.hpp:4199] #1 Source "/typesense/src/ratelimit_manager.cpp", line 429, in init [0x6cc5a8]
E20231121 11:49:56.894968 1 backward.hpp:4199] #0 Source "/typesense/include/store.h", line 154, in get [0x4ec6ef]
Segmentation fault (Address not mapped to object [(nil)])
E20231121 11:49:57.017424 1 typesense_server.cpp:102] Typesense 0.24.0 is terminating abruptly.
I’ve had this happen before - never fun. But this is solvable.
First, let’s get the YAML for the pod
$ kubectl get pod immichprod-typesense-55bf4bd8d6-sg49p -n immich -o yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2023-11-21T11:49:24Z"
generateName: immichprod-typesense-55bf4bd8d6-
labels:
app.kubernetes.io/instance: immichprod
app.kubernetes.io/name: typesense
pod-template-hash: 55bf4bd8d6
name: immichprod-typesense-55bf4bd8d6-sg49p
namespace: immich
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: immichprod-typesense-55bf4bd8d6
uid: f3e9941f-61f1-41eb-aafa-d5eff526f4a6
resourceVersion: "270227029"
uid: 4e7d3022-468e-4855-800d-75d45863a8bc
spec:
automountServiceAccountToken: true
containers:
- env:
- name: DB_DATABASE_NAME
value: immich
- name: DB_HOSTNAME
value: immichprod-postgresql
- name: DB_PASSWORD
value: notthepassword
- name: DB_USERNAME
value: immich
- name: IMMICH_MACHINE_LEARNING_URL
value: http://immichprod-machine-learning:3003
- name: IMMICH_SERVER_URL
value: http://immichprod-server:3001
- name: IMMICH_WEB_URL
value: http://immichprod-web:3000
- name: REDIS_HOSTNAME
value: immichprod-redis-master
- name: TYPESENSE_API_KEY
value: typesense
- name: TYPESENSE_DATA_DIR
value: /tsdata
- name: TYPESENSE_ENABLED
value: "true"
- name: TYPESENSE_HOST
value: immichprod-typesense
image: docker.io/typesense/typesense:0.24.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: immichprod-typesense
ports:
- containerPort: 8108
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /tsdata
name: tsdata
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-896hw
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: builder-hp-elitebook-850-g2
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: tsdata
persistentVolumeClaim:
claimName: immichprod-typesense-tsdata
- name: kube-api-access-896hw
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-11-21T11:49:27Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2023-11-21T11:49:27Z"
message: 'containers with unready status: [immichprod-typesense]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2023-11-21T11:49:27Z"
message: 'containers with unready status: [immichprod-typesense]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2023-11-21T11:49:26Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://b9665cc80044ef1efa538949e8cf0277587c7ad999ff3cb72d26e4acabf132a6
image: docker.io/typesense/typesense:0.24.0
imageID: docker.io/typesense/typesense@sha256:3cc1251f09ef6c75a5b1f2751c04e7265c770c0f2b69cba1f9a9f20da57cfa28
lastState:
terminated:
containerID: containerd://b9665cc80044ef1efa538949e8cf0277587c7ad999ff3cb72d26e4acabf132a6
exitCode: 1
finishedAt: "2023-11-21T11:51:10Z"
reason: Error
startedAt: "2023-11-21T11:51:09Z"
name: immichprod-typesense
ready: false
restartCount: 4
started: false
state:
waiting:
message: back-off 1m20s restarting failed container=immichprod-typesense pod=immichprod-typesense-55bf4bd8d6-sg49p_immich(4e7d3022-468e-4855-800d-75d45863a8bc)
reason: CrashLoopBackOff
hostIP: 192.168.1.215
phase: Running
podIP: 10.42.3.148
podIPs:
- ip: 10.42.3.148
qosClass: BestEffort
startTime: "2023-11-21T11:49:27Z"
The parts we care about are the volumeMounts in the spec:
volumeMounts:
- mountPath: /tsdata
name: tsdata
and the source of those volumeMounts, the volumes:
volumes:
- name: tsdata
persistentVolumeClaim:
claimName: immichprod-typesense-tsdata
I’ll create a quick utility pod that we can use to fix this:
$ cat immichpod.yaml
apiVersion: v1
kind: Pod
metadata:
name: ubuntu-pod
spec:
containers:
- name: ubuntu-container
image: ubuntu
command: ["tail", "-f", "/dev/null"]
volumeMounts:
- name: immichprod-typesense-tsdata
mountPath: /tsdata
volumes:
- name: immichprod-typesense-tsdata
persistentVolumeClaim:
claimName: immichprod-typesense-tsdata
$ kubectl apply -f immichpod.yaml -n immich
pod/ubuntu-pod created
Next, I’ll exec into the pod
$ kubectl get pods -n immich
NAME READY STATUS RESTARTS AGE
immichprod-redis-master-0 1/1 Running 0 5d18h
immichprod-web-dd4b87954-g2n56 1/1 Running 0 21d
immichprod-machine-learning-75d84dd96f-rzswr 1/1 Running 0 5d19h
immichprod-postgresql-0 1/1 Running 0 22d
immichprod-proxy-764698ffb4-ncmst 1/1 Running 12 (4d16h ago) 5d19h
immichprod-typesense-55bf4bd8d6-sg49p 0/1 CrashLoopBackOff 7 (68s ago) 12m
ubuntu-pod 1/1 Running 0 23s
immichprod-microservices-5b969c856d-2wnq8 0/1 CrashLoopBackOff 918 (21s ago) 5d19h
immichprod-server-c6698b848-7nlj4 0/1 CrashLoopBackOff 7 (6s ago) 12m
$ kubectl exec -it ubuntu-pod -n immich -- /bin/bash
root@ubuntu-pod:/#
I’ll remove the lock
$ kubectl exec -it ubuntu-pod -n immich -- /bin/bash
root@ubuntu-pod:/# ls /tsdata/db/
000975.log CURRENT LOCK LOG.old.1700567469760409 LOG.old.1700567729831999 MANIFEST-000004 archive
000977.sst IDENTITY LOG LOG.old.1700567558745219 LOG.old.1700568039807357 OPTIONS-000007
root@ubuntu-pod:/# rm -f /tsdata/db/LOCK
root@ubuntu-pod:/# rm -f /tsdata/meta/LOCK
root@ubuntu-pod:/# exit
exit
Then rotate the crashed pods and watch until they come back online:
$ kubectl get pods -n immich
NAME READY STATUS RESTARTS AGE
immichprod-redis-master-0 1/1 Running 0 5d18h
immichprod-web-dd4b87954-g2n56 1/1 Running 0 21d
immichprod-machine-learning-75d84dd96f-rzswr 1/1 Running 0 5d19h
immichprod-postgresql-0 1/1 Running 0 22d
immichprod-proxy-764698ffb4-ncmst 1/1 Running 12 (4d16h ago) 5d19h
ubuntu-pod 1/1 Running 0 5m28s
immichprod-microservices-5b969c856d-7mcbn 1/1 Running 0 2m44s
immichprod-typesense-55bf4bd8d6-c67dk 1/1 Running 0 65s
immichprod-server-c6698b848-sh2nr 1/1 Running 2 (32s ago) 2m38s
Now we are back to good!
Ansible Checks with Bash
I didn’t want to just leave it there. I really should keep an eye out for failing systems.
I could write a check in Datadog on pods, that would work.
The other more generic approach might be to use a scheduled job in AWX.
First, I’ll need a script that can watch for 4xx and 5xx responses from a website. I added in some email notifications using resend.dev (see the Tools Roundup post on Resend.dev).
$ cat ./checkWebsiteAndPage.sh
#!/bin/bash
URL=$1
# Resend.dev
RESTKN=$2
FROME=$3
TOE=$4
echo '{"from":"' | tr -d '\n' > payload.json
echo $FROME | tr -d '\n' >> payload.json
echo '", "to": "' | tr -d '\n' >> payload.json
echo $TOE | tr -d '\n' >> payload.json
echo '", "subject": "' | tr -d '\n' >> payload.json
echo $URL | tr -d '\n' >> payload.json
echo ' in error", "html": "<h1>' | tr -d '\n' >> payload.json
echo $URL | tr -d '\n' >> payload.json
if curl -k -I $URL 2>&1 | grep -q "HTTP/1.1 5"; then
echo "$URL Website is returning a 5xx error"
echo ' generated a 5xx code</h1>"' >> payload.json
echo '}' >> payload.json
cat payload.json
curl -X POST -H "Authorization: Bearer $RESTKN" -H 'Content-Type: application/json' -d @payload.json 'https://api.resend.com/emails'
exit
else
echo "$URL Website is not returning a 5xx error"
fi
if curl -k -I $URL 2>&1 | grep -q "HTTP/1.1 4"; then
echo "$URL Website is returning a 4xx error"
echo ' generated a 4xx code</h1>"' >> payload.json
echo '}' >> payload.json
cat payload.json
curl -X POST -H "Authorization: Bearer $RESTKN" -H 'Content-Type: application/json' -d @payload.json 'https://api.resend.com/emails'
exit
else
echo "$URL Website is not returning a 4xx error"
fi
I can do a quick test of a dead service:
$ ./checkWebsiteAndPage.sh https://gbwebui.freshbrewed.science re_xxxxxxxxxxxxxxxxxxxx onboarding@resend.dev isaac.johnson@gmail.com
https://gbwebui.freshbrewed.science Website is returning a 5xx error
{"from":"onboarding@resend.dev", "to": "isaac.johnson@gmail.com", "subject": "https://gbwebui.freshbrewed.science in error", "html": "<h1>https://gbwebui.freshbrewed.science generated a 5xx code</h1>"
}
which fired an email
Next, I’ll make a new playbook that can use that script
$ cat checkForBustedInternalSites.yaml
- name: Check for Broken Services
hosts: all
tasks:
- name: Transfer the script
copy: src=checkWebsiteAndPage.sh dest=/tmp mode=0755
- name: Check Immich
ansible.builtin.shell: |
./checkWebsiteAndPage.sh https://photos.freshbrewed.science
args:
chdir: /tmp
- name: Check GBWebUI
ansible.builtin.shell: |
# Known fail - to test
./checkWebsiteAndPage.sh https://gbwebui.freshbrewed.science
args:
chdir: /tmp
I’ll create a template to use it
then launch
It successfully sent an email on the check
Maintaining a large hardcoded list seems fraught with problems. Let’s change this to a loop
- name: Check for Broken Services
hosts: all
tasks:
- name: Transfer the script
copy: src=checkWebsiteAndPage.sh dest=/tmp mode=0755
- name: Check Immich
ansible.builtin.shell: |
./checkWebsiteAndPage.sh {{ item }} {{ RESENDTOKEN }} {{ FROMEMAIL }} {{ TOEMAIL }}
loop: "{{ hostlist.split(',') }}"
args:
chdir: /tmp
Then update the template vars
Actually, I needed to add the protocol in that variable to make it work:
hostlist: https://photos.freshbrewed.science,https://gbwebui.freshbrewed.science
Next, I want to use PagerDuty to alert me if there is an issue. I need to fetch the email from the service.
Next, I’ll do one more test with the gbwebui
endpoint i know will fail
which triggered alerts
Now, I want to finish by setting just the services I care to monitor and using my monitoring email.
I’ll then add a schedule
Set to daily
which I can see set once saved
You can get both the script and playbook in my public Ansible library.
I should point out that you do need to verify a domain to send to other places (like the PagerDuty email integration).
I personally have been using Gandi for the last 15+ years for cheap domains.
One could also send emails using SendGrid or any other outgoing SMTP service.
Summary
My services took a dive, which happens from time to time. But it made me feel a bit silly calling myself an SRE Architect when I’m not actually doing the Site Reliability Engineering work on my own stuff.
Fixing the problem was a must, and that meant quickly mounting an equivalent PVC or local volume to a temporary pod. But beyond just getting the stuck system up and running, I should have had better monitoring in place to catch this in the first place.
To prevent it from happening again, I created the script and playbook to run on a schedule tied to Pagerduty. It’s low-tech but a reliable way to check (provided, of course, AWX doesn’t go down).