Datadog and PagerDuty: Infrastructure Alerts

Published: Oct 8, 2024 by Isaac Johnson

Early this morning my Pagerduty went off (though I set the schedule to not wake me, but just text).

/content/images/2024/10/datadog-01.png

After breakfast, I went to check my host only to find it was doing just dandy:

/content/images/2024/10/datadog-02.jpg

Curious… So i sat at my desk and noted the alert light was on (so the alert definately came from Datadog)

/content/images/2024/10/datadog-03.jpg

… Yet, logging in was fine

builder@DESKTOP-QADGF36:~/Workspaces/rustpad$ ssh builder@192.168.1.100
Welcome to Ubuntu 22.04.2 LTS (GNU/Linux 6.8.0-40-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

Expanded Security Maintenance for Applications is not enabled.

184 updates can be applied immediately.
To see these additional updates run: apt list --upgradable

17 additional security updates can be applied with ESM Apps.
Learn more about enabling ESM Apps service at https://ubuntu.com/esm


The list of available updates is more than a week old.
To check for new updates run: sudo apt update
*** System restart required ***
Last login: Tue Sep 17 05:57:28 2024 from 192.168.1.160
builder@builder-T100:~$ uptime
 06:00:46 up 15 days, 11:00,  2 users,  load average: 2.92, 2.82, 3.00

I checked Pagerduty at this point

/content/images/2024/10/datadog-05.png

Which showed me a momentary outage (or failure to report from the DD agent) around 2:43a

/content/images/2024/10/datadog-06.png

I was convinced I would need to tweak my rule so I started writing this writeup on false positives and retry logic… However, when I went to upload an image to Filegator, I was suddenly clued in to the issue

/content/images/2024/10/datadog-04.jpg

I jumped back to the terminal and checked disk space

builder@builder-T100:~$ df -h
Filesystem                                      Size  Used Avail Use% Mounted on
tmpfs                                           766M  8.1M  758M   2% /run
/dev/sda2                                       234G  222G     0 100% /
tmpfs                                           3.8G   21M  3.8G   1% /dev/shm
tmpfs                                           5.0M  4.0K  5.0M   1% /run/lock
efivarfs                                        192K   98K   90K  53% /sys/firmware/efi/efivars
/dev/sda1                                       511M  6.1M  505M   2% /boot/efi
192.168.1.116:/volume1/docker-filestation       8.8T  1.4T  7.4T  16% /mnt/filestation
192.168.1.116:/volume1/k3sPrimary01             8.8T  1.4T  7.4T  16% /mnt/sirnasilotk3sprimary01
192.168.1.129:/volume1/postgres-prod-dbbackups  5.5T  2.8T  2.8T  51% /mnt/psqlbackups
192.168.1.129:/volume1/linuxbackups             5.5T  2.8T  2.8T  51% /mnt/linuxbackups
tmpfs                                           766M  104K  766M   1% /run/user/1000
shm                                              64M     0   64M   0% /var/snap/microk8s/common/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3101a3ad60188dc1a8321e996843601932b6be5e73854206678e62565a69b948/shm
shm                                              64M     0   64M   0% /var/snap/microk8s/common/run/containerd/io.containerd.grpc.v1.cri/sandboxes/07851b89301029156e652729af347514e00afe2c2d7886b75c1ad528fd0838ac/shm
shm                                              64M     0   64M   0% /var/snap/microk8s/common/run/containerd/io.containerd.grpc.v1.cri/sandboxes/31276e3ae6ca90ab18bd862f2c5a630fc254a2f7c83308454b2cbbc19df1115e/shm
shm                                              64M     0   64M   0% /var/snap/microk8s/common/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8f15ea5a233ddddc4584917e57747628425226ad11c2fa263a06d62fa631914f/shm

Darf! It has run out of space!

My usual step is to first start with some du commands to see if I can find the bad egg

builder@builder-T100:~$ sudo du -chs /*
0       /bin
275M    /boot
4.0K    /cdrom
16K     /dev
32M     /etc
123G    /home
0       /lib
0       /lib32
0       /lib64
0       /libx32
16K     /lost+found
4.0K    /media
3.3G    /mnt
2.9G    /opt
du: cannot read directory '/proc/26193/task/26193/net': Invalid argument
du: cannot read directory '/proc/26193/net': Invalid argument
du: cannot access '/proc/327122/task/327122/fd/4': No such file or directory
du: cannot access '/proc/327122/task/327122/fdinfo/4': No such file or directory
du: cannot access '/proc/327122/fd/3': No such file or directory
du: cannot access '/proc/327122/fdinfo/3': No such file or directory
du: cannot read directory '/proc/544309/task/544309/net': Invalid argument
du: cannot read directory '/proc/544309/net': Invalid argument
du: cannot read directory '/proc/546520/task/546520/net': Invalid argument
du: cannot read directory '/proc/546520/net': Invalid argument
0       /proc
49M     /root
8.2M    /run
0       /sbin
9.3G    /snap
4.0K    /srv
2.1G    /swapfile
0       /sys
11M     /tmp
8.0G    /usr
du: fts_read failed: /var/lib/docker/overlay2/7c82c2381db28d47332041e2d899e82b83b45dafa94e17b1c2501c8b5b166160/merged: No such file or directory
176G    total

At that same time, I checked /var/log which is often a culprit and low-hanging-fruit to clean (but the only large file was the journal)

builder@builder-T100:/var/log$ du -chs ./* 2>&1 | tail -n1
4.4G    total

I just kept diving in looking to find my waste/cause

builder@builder-T100:/home$ sudo du -chs ./*
121G    ./builder
1.7G    ./linuxbrew
8.0K    ./node
123G    total
builder@builder-T100:/home$ cd builder
builder@builder-T100:~$ sudo du -chs ./*

That’s when it jumped out

/content/images/2024/10/datadog-07.png

What was I thinking? A once per hour backup with no cleanup?!?!

/content/images/2024/10/datadog-08.png

I immediately checked GCP - yes, I was keeping all of them

/content/images/2024/10/datadog-09.png

Though, thankfully I had a 30d delete rule so there was an upper limit in place

/content/images/2024/10/datadog-10.png

While I had an initial spike around Sept 23rd, I really haven’t been paying much extra since – That is, no GCP Budget spikes (which was my first worry)

/content/images/2024/10/datadog-11.png

Fixing

First, let’s confirm our Data Transfer service job

/content/images/2024/10/datadog-12.png

Indeed, it is set to weekly starting Sept 16th. And key to note is that it’s not set to delete if I cleanup the source:

/content/images/2024/10/datadog-13.png

There are two ways we could solve this - first, we could have DTS/STS delete the file after transfer:

/content/images/2024/10/datadog-14.png

However, if there is any reason DTS does not run, we’ll fill up again.

The real issue here was/is hourly backups

builder@builder-T100:~/sirnasilot-backups$ ls -l | wc -l
197

builder@builder-T100:~/sirnasilot-backups$ crontab -l
@reboot /sbin/swapoff -a
45 3 * * * tar -zcvf /mnt/filestation/vaultbackups.$(date '+\%Y-\%m-\%d_\%Hh\%Mm').tgz /home/builder/vaultwarden/data
16 6 * * * tar -zcvf /mnt/filestation/filegator.$(date '+\%Y-\%m-\%d_\%Hh\%Mm').tgz /home/builder/filegator/repository2
33 * * * * tar -czf /home/builder/sirnasilot-backups/sirnasilot_backup_$(date '+\%Y-\%m-\%d_\%Hh\%Mm').tgz /mnt/sirnasilotk3sprimary01

that never should have been 33 * * * * as that makes them 33m past the hour – thus 24 files a day.

Knowing that I already have the backups in GCP buckets, I first remediated the real start of our issues and wiped the existing tgz files

/content/images/2024/10/datadog-15.png

That remediated the first issue. I then set it the backup cron to a more reasonable daily

builder@builder-T100:~/sirnasilot-backups$ crontab -l
@reboot /sbin/swapoff -a
45 3 * * * tar -zcvf /mnt/filestation/vaultbackups.$(date '+\%Y-\%m-\%d_\%Hh\%Mm').tgz /home/builder/vaultwarden/data
16 6 * * * tar -zcvf /mnt/filestation/filegator.$(date '+\%Y-\%m-\%d_\%Hh\%Mm').tgz /home/builder/filegator/repository2
33 3 * * * tar -czf /home/builder/sirnasilot-backups/sirnasilot_backup_$(date '+\%Y-\%m-\%d_\%Hh\%Mm').tgz /mnt/sirnasilotk3sprimary01

Lastly, I decided to set the STS job to cleanup when done.

/content/images/2024/10/datadog-16.png

Which we can now see

/content/images/2024/10/datadog-17.png

I did a quick check of any crashed containers on the Docker host, but that which is up should be and that which is off should be.

builder@builder-T100:~/sirnasilot-backups$ docker ps -a
CONTAINER ID   IMAGE                                                            COMMAND                   CREATED         STATUS                      PORTS                                                                                                                                                      NAMES
829f99ee3609   gcr.io/cloud-ingest/tsop-agent:latest                            "python3 ./autoupdat…"    2 weeks ago     Up 2 weeks                                                                                                                                                                             adoring_solomon
0c60f77715b2   mbixtech/observium:23.9                                          "/usr/bin/supervisor…"    2 weeks ago     Up 2 weeks                  0.0.0.0:8078->80/tcp, :::8078->80/tcp                                                                                                                      observium-app-1
43404004ce8c   mariadb:10.6.4                                                   "docker-entrypoint.s…"    2 weeks ago     Up 2 weeks                  3306/tcp
                                                                        observium-db-1
7bac56d1c497   filegator/filegator                                              "docker-php-entrypoi…"    3 weeks ago     Up 2 weeks                  80/tcp, 0.0.0.0:7098->8080/tcp, :::7098->8080/tcp
                                                                        funny_banzai
1f0337971b9f   otel/opentelemetry-collector-contrib                             "/otelcol-contrib --…"    6 weeks ago     Exited (255) 4 weeks ago    55678-55679/tcp, 0.0.0.0:55681->55681/tcp, :::55681->55681/tcp, 0.0.0.0:54317->4317/tcp, :::54317->4317/tcp, 0.0.0.0:54318->4318/tcp, :::54318->4318/tcp   opentelemetry-collector
810a2c05f761   docmost/docmost:latest                                           "docker-entrypoint.s…"    2 months ago    Exited (1) 2 months ago
                                                                        docmost-docmost-1
66fc44ec76fa   postgres:16-alpine                                               "docker-entrypoint.s…"    2 months ago    Exited (0) 2 months ago
                                                                        docmost-db-1
44345e8c762f   redis:7.2-alpine                                                 "docker-entrypoint.s…"    2 months ago    Exited (0) 2 months ago

... snip ...

My last bit of cleanup is to make a note in PagerDuty as to why I had this, just for my own memory (The SRE term is an RCA)

/content/images/2024/10/datadog-18.png

And then turn off my desk alert

builder@builder-T100:~/sirnasilot-backups$ curl -X GET "https://kasarest.freshbrewed.science/off?devip=192.168.1.24&type=plug&apikey=$APIKEY"
<h1>Turning Off 192.168.1.24: Param plug
Param 192.168.1.24
Param off
Turning off OfficeLamp
Run from script
</h1>

(sorry there is no video.. but i did get a satifying little click as the Kasa plug clicked off and my modified boat trailer light went dark).

Datadog disk check

I should have been alerted to this much sooner. Seems like a really simple miss.

I can see Datadog knows about my volumes

/content/images/2024/10/datadog-19.png

If we open it in the Host dashboard, we can clearly see it filled up (and is now a bit better)

/content/images/2024/10/datadog-20.png

I’ll go to monitors and make a new Metric monitor

/content/images/2024/10/datadog-21.png

Note that I’m looking at a particular drive with the limiter on the hostname (first green arrow) and that the alert and warning thresholds are in Bytes (second arrow) so you have to use rather larger integers to create thresholds for GiB

/content/images/2024/10/datadog-22.png

I’m going to set it to page me with Pagerduty but also email me as well. I decided to set a re-alert in case I miss it the first time

/content/images/2024/10/datadog-23.png

Normally I test notifications before saving, but I know (from this morning) the Datadog and Pagerduty connection is working dandy. So I saved it and can now see the active Dockerhost monitor is set up.

/content/images/2024/10/datadog-24.png

Summary

This was a quick post - we were alerted to a short outage at night via Pagerduty. Following that up we quick determined the host was live but upon further investigation, we saw that it had a completely full root volume.

We found and fixed the space issue, corrected the underlying backups system - both modifying Crontab and changing the delete setting for the GCP Transfer Job.

Lastly, we created a new Metric based host disk space alert in Datadog so if it ever gets close again, we can get alerted.

Datadog Pagerduty GCP Alerts Docker

Have something to add? Feedback? You can use the feedback form

Isaac Johnson

Isaac Johnson

Cloud Solutions Architect

Isaac is a CSA and DevOps engineer who focuses on cloud migrations and devops processes. He also is a dad to three wonderful daughters (hence the references to Princess King sprinkled throughout the blog).

Theme built by C.S. Rhymes