Cast.ai for K8s Cost Optimization

Published: Apr 11, 2023 by Isaac Johnson

Cast.ai came up in a search for Open-Source Cost Management tools (that I have yet to try). While it isn’t open-source, at least as of a couple years ago, the founder promised to make it free for OS projects. I decided to give it a spin as they do have a rather usable free tier.

I’ll start off by saying this, at present, is rather married to the big three; AWS, Azure and Google. I’ll cover some rigging to apply to your on-prem K3s, but that is not what it is designed to manage.

Cast.ai is far more than just a Cost watching tool (like Ternary). It’s focused on Kubernetes only but also has a security scanning component as well as a custom autoscaler. They bill themselves as “the leading all-in-one platform for Kubernetes automation, optimization, security, and cost management.”

Let’s signup and see what we can do!

Signup

Let’s signup for Cast for free from the login page

/content/images/2023/04/cast-01.png

It then has a wizard to ask me some basic info

/content/images/2023/04/cast-02.png

The constant ads and testominials make me suss…

/content/images/2023/04/why-are-you-trying-so-hard-why.gif

Off the bat they added a ?fake? demo cluster to see their product in action

/content/images/2023/04/cast-03.png

I’ll walk through the “Connect your cluster” wizard, but I was suprised that they just offer three cloud options. Granted, it is the big three. But what about on-prem or other providers like CIVO, Akamai (Linode), or IBM?

/content/images/2023/04/cast-04.png

It gives a script to run

/content/images/2023/04/cast-05.png

$ curl -H "Authorization: Token asdfasdfasdfsadfasdfasdfasdfasdfasdfasdfasdfasdfasdfasdfasdf" "https://api.cast.ai/v1/agent.yaml?provider=aks" | kubectl apply -f -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11276    0 11276    0     0  19644      0 --:--:-- --:--:-- --:--:-- 19644
namespace/castai-agent created
resourcequota/castai-agent-critical-pods created
serviceaccount/castai-agent created
secret/castai-agent created
configmap/castai-agent-autoscaler created
clusterrole.rbac.authorization.k8s.io/castai-agent created
clusterrolebinding.rbac.authorization.k8s.io/castai-agent created
role.rbac.authorization.k8s.io/castai-agent created
rolebinding.rbac.authorization.k8s.io/castai-agent created
deployment.apps/castai-agent-cpvpa created
deployment.apps/castai-agent created

It will then try to connect

/content/images/2023/04/cast-06.png

Because I lied and really ran this on K3s, there was a consequence - it is crashing the agent pod because of missing env vars

$ kubectl logs -n castai-agent  castai-agent-6b854bcd6c-nbjs8
time="2023-04-08T20:27:11Z" level=info msg="running agent version: GitCommit=\"2df460e1e2167de789384209344f7d81ad917b1f\" GitRef=\"refs/tags/v0.42.3\" Version=\"v0.42.3\"" version=v0.42.3
time="2023-04-08T20:27:11Z" level=info msg="platform URL: https://api.cast.ai" version=v0.42.3
time="2023-04-08T20:27:11Z" level=info msg="starting healthz on port: 9876" version=v0.42.3
time="2023-04-08T20:27:11Z" level=info msg="using provider \"aks\"" provider=aks version=v0.42.3
time="2023-04-08T20:27:11Z" level=info msg="starting with leader election" own_identity=b3cbb0a3-23fb-4d2f-9d7d-8fe6a92c4098 provider=aks version=v0.42.3
I0408 20:27:11.750194       1 leaderelection.go:248] attempting to acquire leader lease castai-agent/agent-leader-election-lock...
I0408 20:27:11.771887       1 leaderelection.go:258] successfully acquired lease castai-agent/agent-leader-election-lock
time="2023-04-08T20:27:11Z" level=info msg="started leading" own_identity=b3cbb0a3-23fb-4d2f-9d7d-8fe6a92c4098 provider=aks version=v0.42.3
time="2023-04-08T20:27:13Z" level=error msg="failed to retrieve instance metadata: Get \"http://169.254.169.254/metadata/instance?api-version=2021-05-01&format=json\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" provider=aks version=v0.42.3
time="2023-04-08T20:27:13Z" level=error msg="agent stopped with an error: registering cluster: autodiscovering cluster metadata: failed to get location metadata: provide required AKS_LOCATION environment variable" provider=aks version=v0.42.3
time="2023-04-08T20:27:13Z" level=info msg="stopped leading" own_identity=b3cbb0a3-23fb-4d2f-9d7d-8fe6a92c4098 provider=aks version=v0.42.3
time="2023-04-08T20:27:13Z" level=info msg="agent shutdown" provider=aks version=v0.42.3

Let’s check deployments

$ kubectl get deployments -n castai-agent
NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
castai-agent-cpvpa   1/1     1            1           4m40s
castai-agent         0/1     1            0           4m40s

and fix that

builder@DESKTOP-QADGF36:~/Workspaces/testDir$ kubectl get deployment castai-agent -n castai-agent -o yaml > castai-agent.dep.yaml
builder@DESKTOP-QADGF36:~/Workspaces/testDir$ kubectl get deployment castai-agent -n castai-agent -o yaml > castai-agent.dep.yaml.bak
builder@DESKTOP-QADGF36:~/Workspaces/testDir$ vi castai-agent.dep.yaml
builder@DESKTOP-QADGF36:~/Workspaces/testDir$ diff -C7 castai-agent.dep.yaml.bak castai-agent.dep.yaml
*** castai-agent.dep.yaml.bak   2023-04-08 15:30:39.115036005 -0500
--- castai-agent.dep.yaml       2023-04-08 15:32:14.358851045 -0500
***************
*** 69,82 ****
--- 69,84 ----
        - env:
          - name: API_URL
            value: api.cast.ai
          - name: PROVIDER
            value: aks
          - name: PPROF_PORT
            value: "6060"
+         - name: AKS_LOCATION
+           value: centralus
+         - name: AKS_SUBSCRIPTION_ID
+           value: d955asdf-adfs-asdf-asf-asdf74cbb22d
+         - name: AKS_NODE_RESOURCE_GROUP
+           value: MyResourceGroup
          envFrom:
          - secretRef:
              name: castai-agent
          image: us-docker.pkg.dev/castai-hub/library/agent:v0.42.3
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
$ kubectl apply -f castai-agent.dep.yaml
deployment.apps/castai-agent configured

Now they are running

$ kubectl get pods -n castai-agent
NAME                                  READY   STATUS    RESTARTS   AGE
castai-agent-cpvpa-84d84596b6-skm7g   1/1     Running   0          14m
castai-agent-5666b456c6-h9hbf         1/1     Running   0          56s

I at least initially do not get much, other than a whole lot of server error warnings

/content/images/2023/04/cast-07.png

Let us be fair, I did jamb this in as a fake AKS cluster.

There are things we can get however

Security Report

Cast.ai does show me by-container security reports which could be useful for finding problematic or risky workloads

/content/images/2023/04/cast-41.png

I was able to break down by app and OS a container’s security vulnerabilities and see the underlying CVEs referenced

/content/images/2023/04/cast-42.png

Let’ try updating just one. The oldest, the azure-vote-front:v3 does have an updated image of v4.

The Helm chart doesn’t expose the image, so I’ll have to tweak the deployment manually

$ kubectl get deployments vote-front-azure-vote-1678278477 -o yaml > avf.dep.yaml
builder@DESKTOP-QADGF36:~/Workspaces/testDir$ kubectl get deployments vote-front-azure-vote-1678278477 -o yaml > avf.dep.yaml.bak
$ vi avf.dep.yaml
$ diff -c5 avf.dep.yaml avf.dep.yaml.bak
*** avf.dep.yaml        2023-04-09 11:17:42.448045215 -0500
--- avf.dep.yaml.bak    2023-04-09 11:17:34.198045153 -0500
***************
*** 40,50 ****
            value: Azure Vote App
          - name: VOTE1VALUE
            value: Cats
          - name: VOTE2VALUE
            value: Dogs
!         image: neilpeterson/azure-vote-front:v4
          imagePullPolicy: Always
          name: azure-vote-front
          ports:
          - containerPort: 80
            protocol: TCP
--- 40,50 ----
            value: Azure Vote App
          - name: VOTE1VALUE
            value: Cats
          - name: VOTE2VALUE
            value: Dogs
!         image: neilpeterson/azure-vote-front:v3
          imagePullPolicy: Always
          name: azure-vote-front
          ports:
          - containerPort: 80
            protocol: TCP
$ kubectl apply -f avf.dep.yaml
Warning: resource deployments/vote-front-azure-vote-1678278477 is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
deployment.apps/vote-front-azure-vote-1678278477 configured
$ kubectl get pods | grep vote
vote-back-azure-vote-1678278477-7dc4cf9668-gfxvw       1/1     Running   0               8d
vote-front-azure-vote-1678278477-67bb8bb55c-k9cxj      1/1     Running   0               78s

which within a few hours synced to the Vulnerabilities page (sadly, did not improve things)

/content/images/2023/04/cast-53.png

I can also see some “Best Practices”. The first was marked high “4.2.2 - Minimize the admission of containers wishing to share the host process ID namespace”.

/content/images/2023/04/cast-46.png

What was nice is that I could the “Resources with this issue (1)” item to see more details.

/content/images/2023/04/cast-47.png

In this case, it’s my Datadog agent which I want to do low level monitoring, so it’s okay

The other super handy item, which makes Cast.ai exceed other tools like Whitesource Bolt, is the details that show both how to identity but more importantly, how to remediate the issue

/content/images/2023/04/cast-48.png

Node list

It pointed out one of my nodes was having an issue

/content/images/2023/04/cast-49.png

Which a local check confirmed

$ kubectl get nodes
NAME                  STATUS     ROLES                  AGE   VERSION
isaac-macbookpro      NotReady   <none>                 32d   v1.23.10+k3s1
anna-macbookair       Ready      control-plane,master   32d   v1.23.10+k3s1
builder-macbookpro2   Ready      <none>                 32d   v1.23.10+k3s1

Indeed, something had knocked the power cord from my old Macbook Pro node. I plugged it back in and powered up

/content/images/2023/04/20230409_113714.jpg

Which, once booted, shows ready

/content/images/2023/04/cast-50.png

AKS

Let’s create a real AKS cluster just to see it in action

I’ll make a basic AKS Dev/Test cluster

/content/images/2023/04/cast-08.png

I’ll enable burstable nodes which means I must use Azure CNI networking

/content/images/2023/04/cast-09.png

We can login to the cluster

$ az aks get-credentials --name MyDevTestCluster -g MyVSEDemoRG --admin
Merged "MyDevTestCluster-admin" as current context in /home/builder/.kube/config
$ kubectx
MyDevTestCluster-admin

Then add it

$ curl -H "Authorization: Token 96sdasdfasdfasdfasdfasdfasdfsadfasdfsadfasdfasdf35" "https://api.cast.ai/v1/agent.yaml?provider=aks" | kubectl apply -f -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11276    0 11276    0     0  49893      0 --:--:-- --:--:-- --:--:-- 49893
namespace/castai-agent created
resourcequota/castai-agent-critical-pods created
serviceaccount/castai-agent created
secret/castai-agent created
configmap/castai-agent-autoscaler created
clusterrole.rbac.authorization.k8s.io/castai-agent created
clusterrolebinding.rbac.authorization.k8s.io/castai-agent created
role.rbac.authorization.k8s.io/castai-agent created
rolebinding.rbac.authorization.k8s.io/castai-agent created
deployment.apps/castai-agent-cpvpa created
deployment.apps/castai-agent created

/content/images/2023/04/cast-10.png

This time we can see results that look fine

/content/images/2023/04/cast-11.png

Right off the bat we can see some suggestions

/content/images/2023/04/cast-12.png

I’ll “Enable CAST AI”

/content/images/2023/04/cast-13.png

$ CASTAI_API_TOKEN=59530b23be91aa81fe237e53db2fbe4bda92fa5c489e09cd7c7b39cdb75f79bd CASTAI_CLUSTER_ID=56e5bdb3-0612-4e8f-95c0-f3bff1442c66 INSTALL_KVISOR=true NODE_GROUP=MC_MyVSEDemoRG_MyDevTestCluster_eastus SUBSCRIPTION_ID=8defc61d-657a-453d-a6ff-cb9f91289a61 /bin/bash -c "$(curl -fsSL https://api.cast.ai/v1/scripts/aks/onboarding.sh)"
Setting active subscription: Visual Studio Enterprise Subscription
Fetching cluster information
Creating custom role: 'CastAKSRole-56e5bdb3'
Creating app registration: 'CAST.AI MyDevTestCluster-56e5bdb3'
Creating app secret: 'MyDevTestCluster-castai'
Creating service principal
Assigning role to 'CAST.AI MyDevTestCluster-56e5bdb3' app
Role 'CastAKSRole-56e5bdb3' doesn't exist.
Still executing...
Installing castai-cluster-controller.
"castai-helm" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "myharbor" chart repository
...Successfully got an update from the "freshbrewed" chart repository
...Successfully got an update from the "castai-helm" chart repository
...Successfully got an update from the "uptime-kuma" chart repository
...Successfully got an update from the "confluentinc" chart repository
...Successfully got an update from the "adwerx" chart repository
...Successfully got an update from the "kuma" chart repository
...Successfully got an update from the "actions-runner-controller" chart repository
...Successfully got an update from the "ngrok" chart repository
...Successfully got an update from the "dapr" chart repository
...Successfully got an update from the "azure-samples" chart repository
...Successfully got an update from the "rhcharts" chart repository
...Successfully got an update from the "sonarqube" chart repository
...Successfully got an update from the "hashicorp" chart repository
...Successfully got an update from the "novum-rgi-helm" chart repository
...Successfully got an update from the "epsagon" chart repository
...Successfully got an update from the "longhorn" chart repository
...Successfully got an update from the "sumologic" chart repository
...Successfully got an update from the "open-telemetry" chart repository
...Successfully got an update from the "nginx-stable" chart repository
...Successfully got an update from the "lifen-charts" chart repository
...Successfully got an update from the "kubecost" chart repository
...Successfully got an update from the "rook-release" chart repository
...Successfully got an update from the "elastic" chart repository
...Successfully got an update from the "datadog" chart repository
...Successfully got an update from the "harbor" chart repository
...Successfully got an update from the "argo-cd" chart repository
...Successfully got an update from the "incubator" chart repository
...Successfully got an update from the "rancher-latest" chart repository
...Successfully got an update from the "crossplane-stable" chart repository
...Successfully got an update from the "newrelic" chart repository
...Successfully got an update from the "gitlab" chart repository
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈
Release "cluster-controller" does not exist. Installing it now.
NAME: cluster-controller
LAST DEPLOYED: Sat Apr  8 16:01:47 2023
NAMESPACE: castai-agent
STATUS: deployed
REVISION: 1
TEST SUITE: None
Finished installing castai-cluster-controller.
Installing castai-spot-handler.
Release "castai-spot-handler" does not exist. Installing it now.
NAME: castai-spot-handler
LAST DEPLOYED: Sat Apr  8 16:01:50 2023
NAMESPACE: castai-agent
STATUS: deployed
REVISION: 1
TEST SUITE: None
Finished installing castai-azure-spot-handler.
Installing castai-evictor.
Release "castai-evictor" does not exist. Installing it now.
NAME: castai-evictor
LAST DEPLOYED: Sat Apr  8 16:01:52 2023
NAMESPACE: castai-agent
STATUS: deployed
REVISION: 1
TEST SUITE: None
Finished installing castai-evictor.
Release "castai-kvisor" does not exist. Installing it now.
NAME: castai-kvisor
LAST DEPLOYED: Sat Apr  8 16:01:54 2023
NAMESPACE: castai-agent
STATUS: deployed
REVISION: 1
TEST SUITE: None
Finished installing castai-kvisor.
--------------------------------------------------------------------------------
Your generated credentials:
{ "subscriptionId": "8defc61d-657a-453d-a6ff-cb9f91289a61", "tenantId": "15d19784-ad58-4a57-a66f-ad1c0f826a45", "clientId": "97c90266-f530-4f22-8964-5e207dbc46e5", "clientSecret": "glO8Q~j-U6OZ43RPQ4Y.4eIhCTz-1QP.yvfILcYO" }
Sending credentials to CAST AI console...
Request failed with error: {"message":"Bad Request", "fieldViolations":[{"field":"credentials", "description":"invalid credentials: graphrbac.ApplicationsClient#GetServicePrincipalsIDByAppID: Failure responding to request: StatusCode=401 -- Original Error: autorest/azure: Service returned an error. Status=401 Code=\"Unknown\" Message=\"Unknown service error\" Details=[{\"odata.error\":{\"code\":\"Authorization_IdentityNotFound\",\"date\":\"2023-04-08T21:01:57\",\"message\":{\"lang\":\"en\",\"value\":\"The identity of the calling application could not be established.\"},\"requestId\":\"e8fb9964-1ab3-4bc8-978f-eaf0ee627f76\"}}]"}]}
Will retry after 10 seconds ...
Still executing...
Successfully sent.

We can see it came back with a Security Insights indicator

/content/images/2023/04/cast-14.png

I was a bit surprised to see it actually created a new App Reg/SP ID during the process

/content/images/2023/04/cast-15.png

Though it didn’t add any excessive perms

/content/images/2023/04/cast-16.png

We can setup Cast AI to be the autoscaler

/content/images/2023/04/cast-17.png

We can set some policies to scale and use spot instances

/content/images/2023/04/cast-18.png

We can see current costs

/content/images/2023/04/cast-19.png

This is actually pretty handy as even after 12h, Azure had yet to put together cost reporting, but I could see some specifics in cast

/content/images/2023/04/cast-20.png

The question remains about accuracy. For instance, the report above suggests I’ve spent US$11.16 this month on the cluster. However, I chose a Dev/Test class cluster where the control plane is free (boo Microsoft. It used to be always free). So I just need to see the results of the MC (managed cluster) resource group to see current spent:

This shows it is more like $0.61 thus far. I’ll give it a few days and check back to see if they are just time shifted, but that is a pretty big difference

I came back the next day and saw that Cast predicted EOD spend to be US$19.13

/content/images/2023/04/cast-55.png

Azure showed the billed amount for yesterday was $10.71. I suspect the number will get closer inline over time.

/content/images/2023/04/cast-54.png

In fact, if I add the initial $1.37 to the current daily $9.34 and project out to the end of April (20*9.34 + 1.37) that would total $188.17 which is pretty darn close

/content/images/2023/04/cast-56.png

Cost by Workload

This assumes we can trust the total cost, however, assuming that we can see a breakdown of relative cost by namespaces

/content/images/2023/04/cast-22.png

Which is either an aggregation or subset of what Azure calls workloads

/content/images/2023/04/cast-23.png

The results in Azure match deployments

$ kubectl get deployments --all-namespaces
NAMESPACE           NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
castai-agent        castai-agent                                   1/1     1            1           18h
castai-agent        castai-agent-cpvpa                             1/1     1            1           18h
castai-agent        castai-cluster-controller                      2/2     2            2           18h
castai-agent        castai-evictor                                 0/0     0            0           18h
castai-agent        castai-kvisor                                  1/1     1            1           18h
gatekeeper-system   gatekeeper-audit                               1/1     1            1           18h
gatekeeper-system   gatekeeper-controller                          2/2     2            2           18h
kube-system         aci-connector-linux                            1/1     1            1           18h
kube-system         azure-policy                                   1/1     1            1           18h
kube-system         azure-policy-webhook                           1/1     1            1           18h
kube-system         coredns                                        5/5     5            5           18h
kube-system         coredns-autoscaler                             1/1     1            1           18h
kube-system         konnectivity-agent                             2/2     2            2           18h
kube-system         metrics-server                                 2/2     2            2           18h
kube-system         microsoft-defender-collector-misc              1/1     1            1           17h
kube-system         microsoft-defender-collector-virtual-kubelet   1/1     1            1           17h

I thought perhaps Helm deploys, but I only see Cast.ai there

$ helm list --all-namespaces
NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                                   APP VERSION
castai-evictor          castai-agent    1               2023-04-08 16:01:52.386588429 -0500 CDT deployed        castai-evictor-0.21.23                  7b8ac7fa265461f86ffb5c255e7891e7b1e5e353
castai-kvisor           castai-agent    1               2023-04-08 16:01:54.618311791 -0500 CDT deployed        castai-kvisor-0.16.24                   v0.27.1
castai-spot-handler     castai-agent    1               2023-04-08 16:01:50.351671727 -0500 CDT deployed        castai-spot-handler-0.18.1              v0.11.0
cluster-controller      castai-agent    1               2023-04-08 16:01:47.560913705 -0500 CDT deployed        castai-cluster-controller-0.49.1        v0.34.2

The number of Nodes I’ve add thus far is 3, with one being a spot instance

$ kubectl get nodes --all-namespaces
NAME                                STATUS   ROLES   AGE   VERSION
aks-agentpool-12803503-vmss000000   Ready    agent   18h   v1.24.10
aks-agentpool-12803503-vmss000003   Ready    agent   18h   v1.24.10
virtual-node-aci-linux              Ready    agent   18h   v1.19.10-vk-azure-aci-1.4.8

I believe that 3rd Virtual is there for when needed, but doesn’t actually exist as provisioned infrastructure until used

/content/images/2023/04/cast-24.png

If Cast.ai didn’t really account for the Virtual, it might see 3 nodes and know the class is B4ms. Totaling the 18 hours, that would get close the total we saw

/content/images/2023/04/cast-25.png

One thing I noticed is that after I had setup Cast.ai to optimize the cluster, it created some Cast Node Pools

/content/images/2023/04/cast-26.png

I see they have taints to keep them from scheduling. But both pools were set to not use Azure Spot Instances.

/content/images/2023/04/cast-27.png

This is where I have a touch of a struggle. My cluster was setup to use Azure Virtual Nodes which essentially scale out to ACI for per-second pricing on pods. If there is sustained usage, then a VM is scheduled.

Slam the Cluster

Let’s slam the cluster with 100 Busy Boxes. I used a JFelten chart (which I modified here)

builder@DESKTOP-QADGF36:~/Workspaces/ijohnson-helm-charts$ helm install hundredbusyboxes ./charts/busybox/ --set replicaCount=100
NAME: hundredbusyboxes
LAST DEPLOYED: Mon Apr 10 06:17:29 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Congratulations your kubernetes cluster is even busier!

To get busy with this pod:
  export POD_NAME=$(kubectl get pods --namespace default -l "app=hundredbusyboxes-busybox" -o jsonpath="{.items[0].metadata.name}")
  kubectl exec -it  $POD_NAME sh
builder@DESKTOP-QADGF36:~/Workspaces/ijohnson-helm-charts$ kubectl get pods
NAME                                       READY   STATUS              RESTARTS   AGE
hundredbusyboxes-busybox-d95874867-27fgx   0/1     Running             0          27s
hundredbusyboxes-busybox-d95874867-2q72x   0/1     Running             0          25s
hundredbusyboxes-busybox-d95874867-2rk42   1/1     Running             0          28s
hundredbusyboxes-busybox-d95874867-44ghm   0/1     Running             0          28s
hundredbusyboxes-busybox-d95874867-488xx   0/1     Running             0          25s
hundredbusyboxes-busybox-d95874867-4fnx2   0/1     Running             0          27s
hundredbusyboxes-busybox-d95874867-4hn77   0/1     Running             0          24s
hundredbusyboxes-busybox-d95874867-4jgwk   0/1     Pending             0          26s
hundredbusyboxes-busybox-d95874867-4p9nn   0/1     Running             0          27s
hundredbusyboxes-busybox-d95874867-4rtld   0/1     Running             0          27s
hundredbusyboxes-busybox-d95874867-5c8fg   0/1     Running             0          28s
hundredbusyboxes-busybox-d95874867-5n926   0/1     Running             0          28s
hundredbusyboxes-busybox-d95874867-5px9l   0/1     ContainerCreating   0          28s
hundredbusyboxes-busybox-d95874867-6cnvt   0/1     Running             0          27s
hundredbusyboxes-busybox-d95874867-6hxdx   0/1     Running             0          27s
hundredbusyboxes-busybox-d95874867-6w4zg   0/1     Running             0          25s
hundredbusyboxes-busybox-d95874867-728xh   0/1     Running             0          27s
hundredbusyboxes-busybox-d95874867-76chl   1/1     Running             0          28s
hundredbusyboxes-busybox-d95874867-777gw   0/1     Pending             0          28s
hundredbusyboxes-busybox-d95874867-79kxx   0/1     Running             0          28s
hundredbusyboxes-busybox-d95874867-7dcbp   0/1     Running             0          27s
hundredbusyboxes-busybox-d95874867-7dpnn   0/1     Running             0          25s
hundredbusyboxes-busybox-d95874867-7fsdr   0/1     Running             0          26s
hundredbusyboxes-busybox-d95874867-8wwgb   0/1     Running             0          26s
hundredbusyboxes-busybox-d95874867-9462c   0/1     Pending             0          25s
hundredbusyboxes-busybox-d95874867-b84qw   0/1     Running             0          28s
hundredbusyboxes-busybox-d95874867-bdf4q   0/1     ContainerCreating   0          25s
hundredbusyboxes-busybox-d95874867-bgp9t   0/1     Running             0          28s
hundredbusyboxes-busybox-d95874867-cbfnv   0/1     Pending             0          28s
hundredbusyboxes-busybox-d95874867-cblhp   0/1     Running             0          28s
hundredbusyboxes-busybox-d95874867-cxssm   0/1     Pending             0          26s
hundredbusyboxes-busybox-d95874867-d2ppc   0/1     Pending             0          26s
hundredbusyboxes-busybox-d95874867-d9n72   0/1     Running             0          28s
hundredbusyboxes-busybox-d95874867-dwbv5   0/1     ContainerCreating   0          28s
hundredbusyboxes-busybox-d95874867-f2cqb   0/1     Running             0          26s
hundredbusyboxes-busybox-d95874867-f6992   0/1     Running             0          28s
hundredbusyboxes-busybox-d95874867-f6lv8   0/1     Running             0          26s
hundredbusyboxes-busybox-d95874867-f8d9d   1/1     Running             0          28s
hundredbusyboxes-busybox-d95874867-fgpr9   0/1     ContainerCreating   0          28s
hundredbusyboxes-busybox-d95874867-fjkrj   0/1     Running             0          28s
hundredbusyboxes-busybox-d95874867-fw7sm   0/1     Running             0          27s
hundredbusyboxes-busybox-d95874867-fwfdh   0/1     Running             0          27s
hundredbusyboxes-busybox-d95874867-fzvcw   0/1     Pending             0          25s
hundredbusyboxes-busybox-d95874867-g28cb   0/1     Running             0          25s
hundredbusyboxes-busybox-d95874867-g92gf   0/1     ContainerCreating   0          25s
hundredbusyboxes-busybox-d95874867-gkwnb   0/1     Running             0          26s
... snip ...

within a minute or so they were all running

$ kubectl get pods
NAME                                       READY   STATUS    RESTARTS   AGE
hundredbusyboxes-busybox-d95874867-27fgx   1/1     Running   0          6m15s
hundredbusyboxes-busybox-d95874867-2q72x   1/1     Running   0          6m13s
hundredbusyboxes-busybox-d95874867-2rk42   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-44ghm   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-488xx   1/1     Running   0          6m13s
hundredbusyboxes-busybox-d95874867-4fnx2   1/1     Running   0          6m15s
hundredbusyboxes-busybox-d95874867-4hn77   1/1     Running   0          6m12s
hundredbusyboxes-busybox-d95874867-4jgwk   1/1     Running   0          6m14s
hundredbusyboxes-busybox-d95874867-4p9nn   1/1     Running   0          6m15s
hundredbusyboxes-busybox-d95874867-4rtld   1/1     Running   0          6m15s
hundredbusyboxes-busybox-d95874867-5c8fg   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-5n926   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-5px9l   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-6cnvt   1/1     Running   0          6m15s
hundredbusyboxes-busybox-d95874867-6hxdx   1/1     Running   0          6m15s
hundredbusyboxes-busybox-d95874867-6w4zg   1/1     Running   0          6m13s
hundredbusyboxes-busybox-d95874867-728xh   1/1     Running   0          6m15s
hundredbusyboxes-busybox-d95874867-76chl   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-777gw   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-79kxx   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-7dcbp   1/1     Running   0          6m15s
hundredbusyboxes-busybox-d95874867-7dpnn   1/1     Running   0          6m13s
hundredbusyboxes-busybox-d95874867-7fsdr   1/1     Running   0          6m14s
hundredbusyboxes-busybox-d95874867-8wwgb   1/1     Running   0          6m14s
hundredbusyboxes-busybox-d95874867-9462c   1/1     Running   0          6m13s
hundredbusyboxes-busybox-d95874867-b84qw   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-bdf4q   1/1     Running   0          6m13s
hundredbusyboxes-busybox-d95874867-bgp9t   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-cbfnv   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-cblhp   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-cxssm   1/1     Running   0          6m14s
hundredbusyboxes-busybox-d95874867-d2ppc   1/1     Running   0          6m14s
hundredbusyboxes-busybox-d95874867-d9n72   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-dwbv5   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-f2cqb   1/1     Running   0          6m14s
hundredbusyboxes-busybox-d95874867-f6992   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-f6lv8   1/1     Running   0          6m14s
hundredbusyboxes-busybox-d95874867-f8d9d   1/1     Running   0          6m16s
hundredbusyboxes-busybox-d95874867-fgpr9   1/1     Running   0          6m16s
...

Yet I had no new nodes

$ kubectl get nodes
NAME                                STATUS   ROLES   AGE   VERSION
aks-agentpool-12803503-vmss000000   Ready    agent   38h   v1.24.10
aks-agentpool-12803503-vmss000003   Ready    agent   38h   v1.24.10
virtual-node-aci-linux              Ready    agent   38h   v1.19.10-vk-azure-aci-1.4.8

Now I’m rather stumped on how to push this cluster over the edge.

/content/images/2023/04/cast-57.png

As they all seem to fit on one node

/content/images/2023/04/cast-58.png

What about 1000?

$ helm upgrade hundredbusyboxes ./charts/busybox/ --set replicaCount=1000
Release "hundredbusyboxes" has been upgraded. Happy Helming!
NAME: hundredbusyboxes
LAST DEPLOYED: Mon Apr 10 06:27:08 2023
NAMESPACE: default
STATUS: deployed
REVISION: 2
TEST SUITE: None
NOTES:
Congratulations your kubernetes cluster is even busier!

To get busy with this pod:
  export POD_NAME=$(kubectl get pods --namespace default -l "app=hundredbusyboxes-busybox" -o jsonpath="{.items[0].metadata.name}")
  kubectl exec -it  $POD_NAME sh

/content/images/2023/04/cast-59.png

That seemed to add some nodes

$ kubectl get nodes
NAME                                STATUS     ROLES    AGE   VERSION
aks-agentpool-12803503-vmss000000   Ready      agent    38h   v1.24.10
aks-agentpool-12803503-vmss000003   Ready      agent    38h   v1.24.10
aks-agentpool-12803503-vmss000004   NotReady   <none>   1s    v1.24.10
aks-agentpool-12803503-vmss000005   NotReady   <none>   3s    v1.24.10
aks-agentpool-12803503-vmss000006   NotReady   <none>   4s    v1.24.10
virtual-node-aci-linux              Ready      agent    38h   v1.19.10-vk-azure-aci-1.4.8

$ kubectl get nodes
NAME                                STATUS   ROLES   AGE   VERSION
aks-agentpool-12803503-vmss000000   Ready    agent   38h   v1.24.10
aks-agentpool-12803503-vmss000003   Ready    agent   38h   v1.24.10
aks-agentpool-12803503-vmss000004   Ready    agent   28s   v1.24.10
aks-agentpool-12803503-vmss000005   Ready    agent   30s   v1.24.10
aks-agentpool-12803503-vmss000006   Ready    agent   31s   v1.24.10
virtual-node-aci-linux              Ready    agent   38h   v1.19.10-vk-azure-aci-1.4.8

/content/images/2023/04/cast-60.png

However, as it scales out, nothing is failing just yet

$ kubectl get pods -o json | jq -r '.items[] | .status.phase' | sort -u
Pending
Running

/content/images/2023/04/cast-61.png

$ kubectl get pods -o json | jq -r '.items[] | .status.phase' | sort -u
Pending
Running

/content/images/2023/04/cast-62.png

I gave this a full size hours to see if it would scale out to a Cast.ai provisioned node, however, 6 hours in and we have less than half our scheduled pods:

/content/images/2023/04/cast-65.png

Yet none marked as failed

$ kubectl get pods -o json | jq -r '.items[] | .status.phase' | sort -u
Pending
Running

And no Cast based worker nodes scheduled

/content/images/2023/04/cast-66.png

I even got an automated alert on pegged CPU, yet no new workers

/content/images/2023/04/cast-67.png

I can also see a growing spike in costs

/content/images/2023/04/cast-68.png

GKE

Let’s try Google as well.

I’ll create a new cluster

/content/images/2023/04/cast-28.png

Let’s try Autopilot first. I think this might confuse Cast.ai, but it’s worth seeing the results

/content/images/2023/04/cast-29.png

I’ll give it a name and region

/content/images/2023/04/cast-30.png

and accept defaults for the rest, then click “Create Cluster” to create

/content/images/2023/04/cast-31.png

Once created, I can “connect”

/content/images/2023/04/cast-32.png

which reminds me the handy one-liner to get the kubeconfig

/content/images/2023/04/cast-33.png

$ gcloud container clusters get-credentials my-castai-autopilot --region us-central1 --project myanthosproject2
Fetching cluster endpoint and auth data.
kubeconfig entry generated for my-castai-autopilot.


To take a quick anonymous survey, run:
  $ gcloud survey

$ kubectl get nodes
NAME                                                 STATUS   ROLES    AGE     VERSION
gk3-my-castai-autopilot-default-pool-55392129-29f2   Ready    <none>   8m41s   v1.24.10-gke.2300
gk3-my-castai-autopilot-default-pool-957044a2-lgk3   Ready    <none>   8m40s   v1.24.10-gke.2300

Back in Cast, I’ll get the GKE command to run

/content/images/2023/04/cast-34.png

$ curl -H "Authorization: Token d1asdfasdfasdfasdfasdfasdfasdfasdfasdfasdfasdf26" "https://api.cast.ai/v1/agent.yaml?provider=gke" | kubectl apply -f -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11276    0 11276    0     0  18762      0 --:--:-- --:--:-- --:--:-- 18730
namespace/castai-agent created
resourcequota/castai-agent-critical-pods created
serviceaccount/castai-agent created
secret/castai-agent created
configmap/castai-agent-autoscaler created
clusterrole.rbac.authorization.k8s.io/castai-agent created
clusterrolebinding.rbac.authorization.k8s.io/castai-agent created
role.rbac.authorization.k8s.io/castai-agent created
rolebinding.rbac.authorization.k8s.io/castai-agent created
Warning: Autopilot set default resource requests for Deployment castai-agent/castai-agent-cpvpa, as resource requests were not specified. See http://g.co/gke/autopilot-defaults
deployment.apps/castai-agent-cpvpa created
Warning: Autopilot increased resource requests for Deployment castai-agent/castai-agent to meet requirements. See http://g.co/gke/autopilot-resources
deployment.apps/castai-agent created

Which then notes it was added successfully

/content/images/2023/04/cast-35.png

Quite immediately I see we are at US$3.65 for the cluster. I have to assume this is just an estimation for end of day (since it matches “Daily cost”)

/content/images/2023/04/cast-36.png

The addition of Cast.ai caused the cluster to add a node immediately

$ kubectl get nodes
NAME                                                 STATUS   ROLES    AGE    VERSION
gk3-my-castai-autopilot-default-pool-55392129-29f2   Ready    <none>   14m    v1.24.10-gke.2300
gk3-my-castai-autopilot-default-pool-957044a2-79ww   Ready    <none>   3m6s   v1.24.10-gke.2300
gk3-my-castai-autopilot-default-pool-957044a2-lgk3   Ready    <none>   14m    v1.24.10-gke.2300

I noticed I couldn’t get details in GCP Console. The alert let me know an API was absent

/content/images/2023/04/cast-37.png

So i went and enabled it

/content/images/2023/04/cast-38.png

Since Google won’t show me the instance type in any window I can find, I can pull it from the node details.

$ kubectl describe node | grep node.kubernetes.io/instance-type
                    node.kubernetes.io/instance-type=e2-medium
                    node.kubernetes.io/instance-type=e2-medium
                    node.kubernetes.io/instance-type=e2-medium

However, we can also get this data from teh “Node list” area of Cast.ai

/content/images/2023/04/cast-43.png

It’s a rough guess but for the region and class, we are looking at 73.38/mo or about US$2.36/day not counting storage or control plane, for a 3 node cluster of e2-mediums.

/content/images/2023/04/cast-39.png

There does not appear to be a Cast.ai autoscaler for GKE yet based on the UI

/content/images/2023/04/cast-40.png

When I delete a cluster

/content/images/2023/04/cast-51.png

It soon shows as disconnected. I can then leave it, reconnect or delete the cluster from the list using the triple dot menu.

/content/images/2023/04/cast-52.png

Terraform

One can add cast.ai through Terraform.

There are examples here.

For instance, for AKS

data "azurerm_subscription" "current" {}

provider "castai" {
  api_token = var.castai_api_token
}

provider "helm" {
  kubernetes {
    host                   = azurerm_kubernetes_cluster.this.kube_config.0.host
    client_certificate     = base64decode(azurerm_kubernetes_cluster.this.kube_config.0.client_certificate)
    client_key             = base64decode(azurerm_kubernetes_cluster.this.kube_config.0.client_key)
    cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.this.kube_config.0.cluster_ca_certificate)
  }
}

# Configure AKS cluster connection to CAST AI using CAST AI aks-cluster module.
module "castai-aks-cluster" {
  source = "castai/aks/castai"

  aks_cluster_name    = var.cluster_name
  aks_cluster_region  = var.cluster_region
  node_resource_group = azurerm_kubernetes_cluster.this.node_resource_group
  resource_group      = azurerm_kubernetes_cluster.this.resource_group_name

  delete_nodes_on_disconnect = var.delete_nodes_on_disconnect

  subscription_id = data.azurerm_subscription.current.subscription_id
  tenant_id       = data.azurerm_subscription.current.tenant_id

  default_node_configuration = module.castai-aks-cluster.castai_node_configurations["default"]

  node_configurations = {
    default = {
      disk_cpu_ratio = 25
      subnets        = [azurerm_subnet.internal.id]
      tags           = var.tags
    }
  }
}

and GKE

module "gke" {
  source                     = "terraform-google-modules/kubernetes-engine/google"
  version                    = "24.1.0"
  project_id                 = var.project_id
  name                       = var.cluster_name
  region                     = var.cluster_region
  zones                      = var.cluster_zones
  network                    = module.vpc.network_name
  subnetwork                 = module.vpc.subnets_names[0]
  ip_range_pods              = local.ip_range_pods
  ip_range_services          = local.ip_range_services
  http_load_balancing        = false
  network_policy             = false
  horizontal_pod_autoscaling = true
  filestore_csi_driver       = false

  node_pools = [
    {
      name               = "default-node-pool"
      machine_type       = "e2-medium"
      min_count          = 2
      max_count          = 10
      local_ssd_count    = 0
      disk_size_gb       = 100
      disk_type          = "pd-standard"
      image_type         = "COS_CONTAINERD"
      auto_repair        = true
      auto_upgrade       = true
      preemptible        = false
      initial_node_count = 2
    },
  ]
}

Company Profile

Cast.ai is based out of Miami, FL and has around 74 employees (or 50). It’s last financing deal was a 2023 Series B/Series A2 in March of this year for $20.14M with Cota Capital and Creandum. It’s privately held and VC backed. It was founded in Oct 2019 by Laurent Gil (Chief Product Officer), Leon Kuperman (CTO), Vilius Zukauskas (Director of Operations), and Yuri Frayman (CEO).

The four of them had sold Zenedge to Oracle in 2018 and went away thinking about how large their cloud bills were. They bill themselves as a “cloud optimization platform that reduces cloud costs, optimizes DevOps, and automates disaster recovery.” While it’s HQ is in Florida, the Director of Engineering, Augustinas Stirbis is in Lithuania. I checked their job board and saw listings for Europe, US and India, all Fully Remote.

So just a fun fact.. I checked ZenEdge in the wayback machine.. It clearly was updated through 2018

/content/images/2023/04/cast-63.png

But Oracle evidentally didn’t renew the domain so now its some placeholder for an Indonesian Casino site

/content/images/2023/04/cast-64.png

Costs

Cast.ai has a few plans. I really like all that is bundled thus far with the free plan.

/content/images/2023/04/cast-45.png

However, that odd desperation kicked in… In the time it took to just write the first sentence in this description while looking at the pricing page, it popped up yet another pitch on POC… the page background blurring is all them. Guys - I’m on the pricing page, wouldn’t that mean I’m somewhat interested? Why blur it and put another pitch in my face?

/content/images/2023/04/cast-44.png

Summary

Cast.ai makes a pretty solid suite that offers a lot of value for a relative low cost. I could easily see myself using it if they expanded to self managed clusters. That said, if one self-manages, I’m not sure how they could really use an autoscaler unless I opened a path to provisioning new hardware.

That said, I see a value play in just scanning my cluster for Security issues and suggested workload optimizations. Personally, I rather trust my cloud provider the most to properly scale - I do not really want to add a dependency on a third-party that would need to upgrade alongside the cluster to work accurately.

One area I would love to see them improve upon is reporting and alerting. Their UI is fabulous, however, I shy away from systems that require me to constantly watch a dashboard. I would much prefer some form of alerting and escalation paths. Here, I’m thinking of things like AWS Budgets that ping me when my spend exceed some preset limits. An alert that someone deployed a high-vulnerability workload, or a node scale out exceeded norms; these kind of things could be very useful.

Castai Kubernetes CostManagement

Have something to add? Feedback? You can use the feedback form

Isaac Johnson

Isaac Johnson

Cloud Solutions Architect

Isaac is a CSA and DevOps engineer who focuses on cloud migrations and devops processes. He also is a dad to three wonderful daughters (hence the references to Princess King sprinkled throughout the blog).

Theme built by C.S. Rhymes