Immich Fix

Published: Nov 25, 2023 by Isaac Johnson

Recently, I hopped on my Immich site and found that a crashing pod was holding the whole system up. I decided to write up how to find and fix as this has happened to me before on other containerized services that have local file dependencies.

It was also a good opportunities to build out a basic site watcher using scheduled jobs in AWX (using bash, PagerDuty and Resend.dev).

The Problem

Simply put, I found the site was down.

/content/images/2023/11/immich-01.png

I panicked that it was due to a cluster crash, which happens at times.

Looking into the pods in the namespace:

$ kubectl get pods -n immich
NAME                                           READY   STATUS             RESTARTS           AGE
immichprod-redis-master-0                      1/1     Running            0                  5d18h
immichprod-web-dd4b87954-g2n56                 1/1     Running            0                  21d
immichprod-machine-learning-75d84dd96f-rzswr   1/1     Running            0                  5d18h
immichprod-postgresql-0                        1/1     Running            0                  21d
immichprod-proxy-764698ffb4-ncmst              1/1     Running            12 (4d16h ago)     5d18h
immichprod-typesense-55bf4bd8d6-z77zc          0/1     CrashLoopBackOff   1633 (109s ago)    5d18h
immichprod-server-c6698b848-lxs59              0/1     Running            2132 (5m56s ago)   5d18h
immichprod-microservices-5b969c856d-2wnq8      1/1     Running            917 (5m43s ago)    5d18h

I could see two were stuck in a crashloop.

My first try is to always just rotate them and see if they come back up:

$ kubectl delete pod immichprod-typesense-55bf4bd8d6-z77zc -n immich && kubectl delete pod immichprod-server-c6698b848-lxs59 -n immich
pod "immichprod-typesense-55bf4bd8d6-z77zc" deleted
pod "immichprod-server-c6698b848-lxs59" deleted

$ kubectl get pods -n immich
NAME                                           READY   STATUS             RESTARTS          AGE
immichprod-redis-master-0                      1/1     Running            0                 5d18h
immichprod-web-dd4b87954-g2n56                 1/1     Running            0                 21d
immichprod-machine-learning-75d84dd96f-rzswr   1/1     Running            0                 5d18h
immichprod-postgresql-0                        1/1     Running            0                 21d
immichprod-proxy-764698ffb4-ncmst              1/1     Running            12 (4d16h ago)    5d18h
immichprod-microservices-5b969c856d-2wnq8      1/1     Running            917 (7m10s ago)   5d18h
immichprod-server-c6698b848-7nlj4              0/1     Running            0                 46s
immichprod-typesense-55bf4bd8d6-sg49p          0/1     CrashLoopBackOff   2 (15s ago)       48s

The logs were do to a local filesystem DB being open when a pod crashed (likely during a forced reboot)

$ kubectl logs immichprod-typesense-55bf4bd8d6-sg49p -n immich
I20231121 11:49:56.728648     1 typesense_server_utils.cpp:357] Starting Typesense 0.24.0
I20231121 11:49:56.728678     1 typesense_server_utils.cpp:360] Typesense is using jemalloc.
I20231121 11:49:56.728938     1 typesense_server_utils.cpp:409] Thread pool size: 32
I20231121 11:49:56.732545     1 store.h:63] Initializing DB by opening state dir: /tsdata/db
E20231121 11:49:56.744969     1 store.h:67] Error while initializing store: IO error: While lock file: /tsdata/db/LOCK: Resource temporarily unavailable
E20231121 11:49:56.744992     1 store.h:69] It seems like the data directory /tsdata/db is already being used by another Typesense server.
E20231121 11:49:56.744998     1 store.h:71] If you are SURE that this is not the case, delete the LOCK file in the data db directory and try again.
I20231121 11:49:56.745041     1 store.h:63] Initializing DB by opening state dir: /tsdata/meta
E20231121 11:49:56.768613     1 store.h:67] Error while initializing store: IO error: While lock file: /tsdata/meta/LOCK: Resource temporarily unavailable
E20231121 11:49:56.768626     1 store.h:69] It seems like the data directory /tsdata/meta is already being used by another Typesense server.
E20231121 11:49:56.768631     1 store.h:71] If you are SURE that this is not the case, delete the LOCK file in the data db directory and try again.
E20231121 11:49:56.894925     1 backward.hpp:4199] Stack trace (most recent call last):
E20231121 11:49:56.894938     1 backward.hpp:4199] #5    Object "/opt/typesense-server", at 0x4c8f50, in _start
E20231121 11:49:56.894943     1 backward.hpp:4199] #4    Object "/usr/lib/x86_64-linux-gnu/libc-2.31.so", at 0x7fe50f528082, in __libc_start_main
E20231121 11:49:56.894946     1 backward.hpp:4199] #3    Source "/typesense/src/main/typesense_server.cpp", line 160, in main [0x4999b2]
E20231121 11:49:56.894950     1 backward.hpp:4199] #2    Source "/typesense/src/typesense_server_utils.cpp", line 452, in run_server [0x6f2d8c]
E20231121 11:49:56.894954     1 backward.hpp:4199] #1    Source "/typesense/src/ratelimit_manager.cpp", line 429, in init [0x6cc5a8]
E20231121 11:49:56.894968     1 backward.hpp:4199] #0    Source "/typesense/include/store.h", line 154, in get [0x4ec6ef]
Segmentation fault (Address not mapped to object [(nil)])
E20231121 11:49:57.017424     1 typesense_server.cpp:102] Typesense 0.24.0 is terminating abruptly.

I’ve had this happen before - never fun. But this is solvable.

First, let’s get the YAML for the pod

$ kubectl get pod immichprod-typesense-55bf4bd8d6-sg49p -n immich -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2023-11-21T11:49:24Z"
  generateName: immichprod-typesense-55bf4bd8d6-
  labels:
    app.kubernetes.io/instance: immichprod
    app.kubernetes.io/name: typesense
    pod-template-hash: 55bf4bd8d6
  name: immichprod-typesense-55bf4bd8d6-sg49p
  namespace: immich
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: immichprod-typesense-55bf4bd8d6
    uid: f3e9941f-61f1-41eb-aafa-d5eff526f4a6
  resourceVersion: "270227029"
  uid: 4e7d3022-468e-4855-800d-75d45863a8bc
spec:
  automountServiceAccountToken: true
  containers:
  - env:
    - name: DB_DATABASE_NAME
      value: immich
    - name: DB_HOSTNAME
      value: immichprod-postgresql
    - name: DB_PASSWORD
      value: notthepassword
    - name: DB_USERNAME
      value: immich
    - name: IMMICH_MACHINE_LEARNING_URL
      value: http://immichprod-machine-learning:3003
    - name: IMMICH_SERVER_URL
      value: http://immichprod-server:3001
    - name: IMMICH_WEB_URL
      value: http://immichprod-web:3000
    - name: REDIS_HOSTNAME
      value: immichprod-redis-master
    - name: TYPESENSE_API_KEY
      value: typesense
    - name: TYPESENSE_DATA_DIR
      value: /tsdata
    - name: TYPESENSE_ENABLED
      value: "true"
    - name: TYPESENSE_HOST
      value: immichprod-typesense
    image: docker.io/typesense/typesense:0.24.0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: http
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: immichprod-typesense
    ports:
    - containerPort: 8108
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: http
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tsdata
      name: tsdata
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-896hw
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: builder-hp-elitebook-850-g2
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: tsdata
    persistentVolumeClaim:
      claimName: immichprod-typesense-tsdata
  - name: kube-api-access-896hw
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-11-21T11:49:27Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-11-21T11:49:27Z"
    message: 'containers with unready status: [immichprod-typesense]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-11-21T11:49:27Z"
    message: 'containers with unready status: [immichprod-typesense]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-11-21T11:49:26Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://b9665cc80044ef1efa538949e8cf0277587c7ad999ff3cb72d26e4acabf132a6
    image: docker.io/typesense/typesense:0.24.0
    imageID: docker.io/typesense/typesense@sha256:3cc1251f09ef6c75a5b1f2751c04e7265c770c0f2b69cba1f9a9f20da57cfa28
    lastState:
      terminated:
        containerID: containerd://b9665cc80044ef1efa538949e8cf0277587c7ad999ff3cb72d26e4acabf132a6
        exitCode: 1
        finishedAt: "2023-11-21T11:51:10Z"
        reason: Error
        startedAt: "2023-11-21T11:51:09Z"
    name: immichprod-typesense
    ready: false
    restartCount: 4
    started: false
    state:
      waiting:
        message: back-off 1m20s restarting failed container=immichprod-typesense pod=immichprod-typesense-55bf4bd8d6-sg49p_immich(4e7d3022-468e-4855-800d-75d45863a8bc)
        reason: CrashLoopBackOff
  hostIP: 192.168.1.215
  phase: Running
  podIP: 10.42.3.148
  podIPs:
  - ip: 10.42.3.148
  qosClass: BestEffort
  startTime: "2023-11-21T11:49:27Z"
  

The parts we care about are the volumeMounts in the spec:

    volumeMounts:
    - mountPath: /tsdata
      name: tsdata 

and the source of those volumeMounts, the volumes:

  volumes:
  - name: tsdata
    persistentVolumeClaim:
      claimName: immichprod-typesense-tsdata

I’ll create a quick utility pod that we can use to fix this:

$ cat immichpod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: ubuntu-pod
spec:
  containers:
  - name: ubuntu-container
    image: ubuntu
    command: ["tail", "-f", "/dev/null"]
    volumeMounts:
    - name: immichprod-typesense-tsdata
      mountPath: /tsdata
  volumes:
  - name: immichprod-typesense-tsdata
    persistentVolumeClaim:
      claimName: immichprod-typesense-tsdata

$ kubectl apply -f immichpod.yaml -n immich
pod/ubuntu-pod created

Next, I’ll exec into the pod

$ kubectl get pods -n immich
NAME                                           READY   STATUS             RESTARTS         AGE
immichprod-redis-master-0                      1/1     Running            0                5d18h
immichprod-web-dd4b87954-g2n56                 1/1     Running            0                21d
immichprod-machine-learning-75d84dd96f-rzswr   1/1     Running            0                5d19h
immichprod-postgresql-0                        1/1     Running            0                22d
immichprod-proxy-764698ffb4-ncmst              1/1     Running            12 (4d16h ago)   5d19h
immichprod-typesense-55bf4bd8d6-sg49p          0/1     CrashLoopBackOff   7 (68s ago)      12m
ubuntu-pod                                     1/1     Running            0                23s
immichprod-microservices-5b969c856d-2wnq8      0/1     CrashLoopBackOff   918 (21s ago)    5d19h
immichprod-server-c6698b848-7nlj4              0/1     CrashLoopBackOff   7 (6s ago)       12m

$ kubectl exec -it ubuntu-pod -n immich -- /bin/bash
root@ubuntu-pod:/#

I’ll remove the lock

$ kubectl exec -it ubuntu-pod -n immich -- /bin/bash
root@ubuntu-pod:/# ls /tsdata/db/
000975.log  CURRENT   LOCK  LOG.old.1700567469760409  LOG.old.1700567729831999  MANIFEST-000004  archive
000977.sst  IDENTITY  LOG   LOG.old.1700567558745219  LOG.old.1700568039807357  OPTIONS-000007
root@ubuntu-pod:/# rm -f /tsdata/db/LOCK
root@ubuntu-pod:/# rm -f /tsdata/meta/LOCK
root@ubuntu-pod:/# exit
exit

Then rotate the crashed pods and watch until they come back online:

$ kubectl get pods -n immich
NAME                                           READY   STATUS    RESTARTS         AGE
immichprod-redis-master-0                      1/1     Running   0                5d18h
immichprod-web-dd4b87954-g2n56                 1/1     Running   0                21d
immichprod-machine-learning-75d84dd96f-rzswr   1/1     Running   0                5d19h
immichprod-postgresql-0                        1/1     Running   0                22d
immichprod-proxy-764698ffb4-ncmst              1/1     Running   12 (4d16h ago)   5d19h
ubuntu-pod                                     1/1     Running   0                5m28s
immichprod-microservices-5b969c856d-7mcbn      1/1     Running   0                2m44s
immichprod-typesense-55bf4bd8d6-c67dk          1/1     Running   0                65s
immichprod-server-c6698b848-sh2nr              1/1     Running   2 (32s ago)      2m38s

Now we are back to good!

/content/images/2023/11/immich-02.png

Ansible Checks with Bash

I didn’t want to just leave it there. I really should keep an eye out for failing systems.

I could write a check in Datadog on pods, that would work.

The other more generic approach might be to use a scheduled job in AWX.

First, I’ll need a script that can watch for 4xx and 5xx responses from a website. I added in some email notifications using resend.dev (see the Tools Roundup post on Resend.dev).

$ cat ./checkWebsiteAndPage.sh
#!/bin/bash

URL=$1

# Resend.dev
RESTKN=$2
FROME=$3
TOE=$4

echo '{"from":"' | tr -d '\n' > payload.json
echo $FROME | tr -d '\n' >> payload.json
echo '", "to": "' | tr -d '\n' >> payload.json
echo $TOE | tr -d '\n' >> payload.json
echo '", "subject": "' | tr -d '\n' >> payload.json
echo $URL | tr -d '\n' >> payload.json
echo ' in error", "html": "<h1>' | tr -d '\n' >> payload.json
echo $URL | tr -d '\n' >> payload.json



if curl -k -I $URL 2>&1 | grep -q "HTTP/1.1 5"; then
   echo "$URL Website is returning a 5xx error"

   echo ' generated a 5xx code</h1>"' >> payload.json
   echo '}' >> payload.json

   cat payload.json
   curl -X POST -H "Authorization: Bearer $RESTKN" -H 'Content-Type: application/json' -d @payload.json 'https://api.resend.com/emails'
   exit
else
   echo "$URL Website is not returning a 5xx error"
fi

if curl -k -I $URL 2>&1 | grep -q "HTTP/1.1 4"; then
   echo "$URL Website is returning a 4xx error"

   echo ' generated a 4xx code</h1>"' >> payload.json
   echo '}' >> payload.json


   cat payload.json
   curl -X POST -H "Authorization: Bearer $RESTKN" -H 'Content-Type: application/json' -d @payload.json 'https://api.resend.com/emails'
   exit
else
   echo "$URL Website is not returning a 4xx error"
fi

I can do a quick test of a dead service:

$ ./checkWebsiteAndPage.sh https://gbwebui.freshbrewed.science re_xxxxxxxxxxxxxxxxxxxx onboarding@resend.dev isaac.johnson@gmail.com
https://gbwebui.freshbrewed.science Website is returning a 5xx error
{"from":"onboarding@resend.dev", "to": "isaac.johnson@gmail.com", "subject": "https://gbwebui.freshbrewed.science in error", "html": "<h1>https://gbwebui.freshbrewed.science generated a 5xx code</h1>"
}

which fired an email

/content/images/2023/11/immich-03.png

Next, I’ll make a new playbook that can use that script

$ cat checkForBustedInternalSites.yaml
- name: Check for Broken Services
  hosts: all

  tasks:
  - name: Transfer the script
    copy: src=checkWebsiteAndPage.sh dest=/tmp mode=0755

  - name: Check Immich
    ansible.builtin.shell: |
      ./checkWebsiteAndPage.sh  https://photos.freshbrewed.science   
    args:
      chdir: /tmp

  - name: Check GBWebUI
    ansible.builtin.shell: |
      # Known fail - to test
      ./checkWebsiteAndPage.sh  https://gbwebui.freshbrewed.science   
    args:
      chdir: /tmp

I’ll create a template to use it

/content/images/2023/11/immich-04.png

then launch

/content/images/2023/11/immich-05.png

It successfully sent an email on the check

/content/images/2023/11/immich-06.png

Maintaining a large hardcoded list seems fraught with problems. Let’s change this to a loop

- name: Check for Broken Services
  hosts: all

  tasks:
  - name: Transfer the script
    copy: src=checkWebsiteAndPage.sh dest=/tmp mode=0755

  - name: Check Immich
    ansible.builtin.shell: |
      ./checkWebsiteAndPage.sh  {{ item }} {{ RESENDTOKEN }} {{ FROMEMAIL }} {{ TOEMAIL }}
    loop: "{{ hostlist.split(',') }}"
    args:
      chdir: /tmp

Then update the template vars

/content/images/2023/11/immich-07.png

Actually, I needed to add the protocol in that variable to make it work:

hostlist: https://photos.freshbrewed.science,https://gbwebui.freshbrewed.science

Next, I want to use PagerDuty to alert me if there is an issue. I need to fetch the email from the service.

/content/images/2023/11/immich-08.png

Next, I’ll do one more test with the gbwebui endpoint i know will fail

/content/images/2023/11/immich-09.png

which triggered alerts

/content/images/2023/11/immich-10.png

Now, I want to finish by setting just the services I care to monitor and using my monitoring email.

/content/images/2023/11/immich-11.png

I’ll then add a schedule

/content/images/2023/11/immich-12.png

Set to daily

/content/images/2023/11/immich-13.png

which I can see set once saved

/content/images/2023/11/immich-14.png

You can get both the script and playbook in my public Ansible library.

I should point out that you do need to verify a domain to send to other places (like the PagerDuty email integration).

/content/images/2023/11/immich-15.png

I personally have been using Gandi for the last 15+ years for cheap domains.

One could also send emails using SendGrid or any other outgoing SMTP service.

Summary

My services took a dive, which happens from time to time. But it made me feel a bit silly calling myself an SRE Architect when I’m not actually doing the Site Reliability Engineering work on my own stuff.

Fixing the problem was a must, and that meant quickly mounting an equivalent PVC or local volume to a temporary pod. But beyond just getting the stuck system up and running, I should have had better monitoring in place to catch this in the first place.

To prevent it from happening again, I created the script and playbook to run on a schedule tied to Pagerduty. It’s low-tech but a reliable way to check (provided, of course, AWX doesn’t go down).

Immich OpenSource Typescript Database Ansible Pagerduty

Have something to add? Feedback? You can use the feedback form

Isaac Johnson

Isaac Johnson

Cloud Solutions Architect

Isaac is a CSA and DevOps engineer who focuses on cloud migrations and devops processes. He also is a dad to three wonderful daughters (hence the references to Princess King sprinkled throughout the blog).

Theme built by C.S. Rhymes