Published: Jun 23, 2023 by Isaac Johnson

This topic of AKS Upgrades has come up a few times for me lately. Questions around how to upgrade, what is affected, how to check and rotate SP IDs. I figured it might be nice to just write it up for all.

We will cover AKS via the CLI, Portal and Terraform (Using TF Cloud). We’ll run some live experiments and cover many different upgrade scenarios.

Creating an AKS cluster

Let’s just create a cluster. To keep it simple, we’ll use the AZ CLI

Resource Group

First, I create a Resource Group

$ az login

$ az account set --subscription 2f0469bd-d00c-4411-85e6-369915da6e54

$ az group create --name showAKSUpgrade --location centralus

Service Principal

Next, I’ll want to create a Service Principal (SP) (aka App Registration) I can use for the cluster

$ az ad sp create-for-rbac -n idjaksupg01sp --skip-assignment --output json

Option '--skip-assignment' has been deprecated and will be removed in a future release.

The output includes credentials that you must protect. Be sure that you do not include these credentials in your code or check the credentials into your source control. For more information, see https://aka.ms/azadsp-cli

{

  "appId": "12c57d86-8cc1-4da4-b0ff-b38dc97ab02d",

  "displayName": "idjaksupg01sp",

  "password": "dGhpcyBpcyBhIHBhc3N3b3JkIGJ1dCBub3QgdGhlIG9uZSB0aGF0IGlzIHJlYWwK",

  "tenant": "da4e9527-0d2b-4842-a1f2-f47cb15531de"

}

$ export SP_PASS="dGhpcyBpcyBhIHBhc3N3b3JkIGJ1dCBub3QgdGhlIG9uZSB0aGF0IGlzIHJlYWwK"
$ export SP_ID="12c57d86-8cc1-4da4-b0ff-b38dc97ab02d"

Generally, I save the SP creds into a file and use that for env vars

$ az ad sp create-for-rbac -n idjaksupg01sp --skip-assignment --output json > my_sp.json$ cat my_sp.json | jq -r .appId
12c57d86-8cc1-4da4-b0ff-b38dc97ab02d

I can then set the ID and PASS on the fly

$ export SP_PASS=`cat my_sp.json | jq -r .password`
$ export SP_ID=`cat my_sp.json | jq -r .appId`

Whether you pipe to a JSON or you set manually, the next steps assume you set SP_ID and SP_PASS.

Create Cluster

Next, I’ll create an autoscaling cluster

$ az aks create --resource-group showAKSUpgrade --name idjaksup01 --location centralus --node-count 3 --enable-cluster-autoscaler --min-count 2 --max-count 4 --generate-ssh-keys --network-plugin azure --network-policy azure --service-principal $SP_ID --client-secret $SP_PASS

Once created, I can list it

$ az aks list -o table

Name        Location    ResourceGroup    KubernetesVersion    CurrentKubernetesVersion    ProvisioningState    Fqdn

----------  ----------  ---------------  -------------------  --------------------------  -------------------  -----------------------------------------------------------------

idjaksup01  centralus   showAKSUpgrade   1.25.6               1.25.6                      Succeeded            idjaksup01-showaksupgrade-8defc6-2940zsw6.hcp.centralus.azmk8s.io

Finding your SP ID

Assuming you are coming back to a cluster, how might you find the Service Principal ID it uses?

One easy way is to use the AZ CLI

az aks show -n idjaksup01 -g showAKSUpgrade --query servicePrincipalProfile.clientId -o tsv

12c57d86-8cc1-4da4-b0ff-b38dc97ab02d

Rotating a Cred

Assuming we want to expire and rotate the Service Principal client secret, we can just ask for a new ID

$ export SP_ID=`az aks show --resource-group showAKSUpgrade --name idjaksup01 --query servicePrincipalProfile.clientId -o tsv`

Alternatively, you can go to AAD and find your App Registration

Then create a new secret under secrets to use (you can also see the expiry on former secrets)

Applying to the Cluster

Let’s now apply the new SP Secret to the cluster

I’ll rotate and apply as two steps

$ export SP_NEW_SECRET=`az ad app credential reset --id "$SP_ID" --query password -o tsv`

$ az aks update-credentials -g showAKSUpgrade -n idjaksup01 --reset-service-principal --service-principal "$SP_ID" --client-secret "$SP_NEW_SECRET"

This step will take a while so be patient.

Verification

We can check our secrets by fetching AAD details from a CSI pod.

We first want to login to the cluster

$ az aks get-credentials -g  showAKSUpgrade -n idjaksup01 --admin
Merged "idjaksup01-admin" as current context in /home/builder/.kube/config

Then let’s find one of the CSI pods

$ kubectl get pods -l app=csi-azuredisk-node -n kube-system
NAME                       READY   STATUS    RESTARTS   AGE
csi-azuredisk-node-88sxc   3/3     Running   0          3d13h
csi-azuredisk-node-hw95j   3/3     Running   0          3d13h

We can now fetch and check the SP ID and PASS

$ kubectl exec -it --container azuredisk -n kube-system csi-azuredisk-node-88sxc -- cat /etc/kubernetes/azure.json
{
    "cloud": "AzurePublicCloud",
    "tenantId": "da4e9527-0d2b-4842-a1f2-f47cb15531de",
    "subscriptionId": "2f0469bd-d00c-4411-85e6-369915da6e54",
    "aadClientId": "12c57d86-8cc1-4da4-b0ff-b38dc97ab02d",
    "aadClientSecret": "bm90IG15IHNlY3JldCwgYnV0IHRoYW5rcyBmb3IgY2hlY2tpbmcK",
    "resourceGroup": "MC_showAKSUpgrade_idjaksup01_centralus",
    "location": "centralus",
    "vmType": "vmss",
    "subnetName": "aks-subnet",
    "securityGroupName": "aks-agentpool-25858899-nsg",
    "vnetName": "aks-vnet-25858899",
    "vnetResourceGroup": "",
    "routeTableName": "aks-agentpool-25858899-routetable",
    "primaryAvailabilitySetName": "",
    "primaryScaleSetName": "aks-nodepool1-28899529-vmss",
    "cloudProviderBackoffMode": "v2",
    "cloudProviderBackoff": true,
    "cloudProviderBackoffRetries": 6,
    "cloudProviderBackoffDuration": 5,
    "cloudProviderRateLimit": true,
    "cloudProviderRateLimitQPS": 10,
    "cloudProviderRateLimitBucket": 100,
    "cloudProviderRateLimitQPSWrite": 10,
    "cloudProviderRateLimitBucketWrite": 100,
    "useManagedIdentityExtension": false,
    "userAssignedIdentityID": "",
    "useInstanceMetadata": true,
    "loadBalancerSku": "Standard",
    "disableOutboundSNAT": false,
    "excludeMasterFromStandardLB": true,
    "providerVaultName": "",
    "maximumLoadBalancerRuleCount": 250,
    "providerKeyName": "k8s",
    "providerKeyVersion": ""
}

The values above we care about are

    "aadClientId": "12c57d86-8cc1-4da4-b0ff-b38dc97ab02d",
    "aadClientSecret": "bm90IG15IHNlY3JldCwgYnV0IHRoYW5rcyBmb3IgY2hlY2tpbmcK",

AKS Upgrades

We can upgrade a cluster two ways.

The first is using the Azure Portal. We’ll look it up under “Cluster configuration” in “Settings”

If you have multiple node pools or manual ones, you can chose to do the Control Plane (master nodes) first.

Clicking the drop down shows all the versions we can go to presently (not all that are out there). AKS can upgrade one major release at a time

Azure will tell us what to watch out for; namely that a few APIs are being deprecated

Kubernetes has removed objects from ApiGroups between version 1.25.6 and 1.26.3. If you have any resources calling the ApiGroups below, migrate them to new ApiGroups ahead of time to avoid impact.Learn more
        
FlowSchema - flowcontrol.apiserver.k8s.io/v1beta1
HorizontalPodAutoscaler - autoscaling/v2beta2
PriorityLevelConfiguration - flowcontrol.apiserver.k8s.io/v1beta1

As I want to see some effects, let’s pause for a moment and install a sample app

Azure Vote App

Note: this ultimately did not create an App with deprecated APIs, but is a nice app we can use to watch our upgrades

I’ll add the Sample Repo and update, in case I haven’t yet

$ helm repo add azure-samples https://azure-samples.github.io/helm-charts/
"azure-samples" already exists with the same configuration, skipping
$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "kube-state-metrics" chart repository
...Successfully got an update from the "myharbor" chart repository
...Successfully got an update from the "freshbrewed" chart repository
...Successfully got an update from the "longhorn" chart repository
...Successfully got an update from the "actions-runner-controller" chart repository
...Successfully got an update from the "adwerx" chart repository
...Successfully got an update from the "confluentinc" chart repository
...Successfully got an update from the "kuma" chart repository
...Successfully got an update from the "hashicorp" chart repository
...Successfully got an update from the "dapr" chart repository
...Successfully got an update from the "akomljen-charts" chart repository
...Successfully got an update from the "kubecost" chart repository
...Successfully got an update from the "opencost" chart repository
...Successfully got an update from the "sumologic" chart repository
...Successfully got an update from the "zabbix-community" chart repository
...Successfully got an update from the "sonarqube" chart repository
...Successfully got an update from the "castai-helm" chart repository
...Successfully got an update from the "jfelten" chart repository
...Successfully got an update from the "azure-samples" chart repository
...Successfully got an update from the "nginx-stable" chart repository
...Successfully got an update from the "novum-rgi-helm" chart repository
...Successfully got an update from the "kiwigrid" chart repository
...Successfully got an update from the "rhcharts" chart repository
...Successfully got an update from the "epsagon" chart repository
...Successfully got an update from the "datadog" chart repository
...Successfully got an update from the "lifen-charts" chart repository
...Successfully got an update from the "rook-release" chart repository
...Successfully got an update from the "elastic" chart repository
...Successfully got an update from the "argo-cd" chart repository
...Successfully got an update from the "harbor" chart repository
...Successfully got an update from the "signoz" chart repository
...Successfully got an update from the "crossplane-stable" chart repository
...Successfully got an update from the "newrelic" chart repository
...Successfully got an update from the "incubator" chart repository
...Successfully got an update from the "grafana" chart repository
...Successfully got an update from the "uptime-kuma" chart repository
...Successfully got an update from the "gitlab" chart repository
...Successfully got an update from the "bitnami" chart repository
...Successfully got an update from the "ngrok" chart repository
...Successfully got an update from the "open-telemetry" chart repository
...Successfully got an update from the "rancher-latest" chart repository
...Successfully got an update from the "prometheus-community" chart repository
Update Complete. ⎈Happy Helming!⎈

Next I’ll add the vote app

$ helm install azure-samples/azure-vote --generate-name
NAME: azure-vote-1687260329
LAST DEPLOYED: Tue Jun 20 06:25:30 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Azure Vote application has been started on your Kubernetes cluster.

Title: Azure Vote App
Vote 1 value: Cats
Vote 2 value: Dogs

The externally accessible IP address can take a minute or so to provision. Run the following command to monitor the provisioning status. Once an External IP address has been provisioned, brows to this IP address to access the Azure Vote application.

kubectl get service -l name=azure-vote-front -w

I can now get the LB IP and view the app

$ kubectl get service -l name=azure-vote-front
NAME               TYPE           CLUSTER-IP    EXTERNAL-IP      PORT(S)        AGE
azure-vote-front   LoadBalancer   10.0.217.17   52.182.220.160   80:30425/TCP   24s

Let’s now add an HPA on the front end

$ kubectl get deployment
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
vote-back-azure-vote-1687260329    1/1     1            1           76s
vote-front-azure-vote-1687260329   1/1     1            1           76s

$ kubectl autoscale deployment vote-back-azure-vote-1687260329  --cpu-percent=50 --min=3 --max=10
horizontalpodautoscaler.autoscaling/vote-back-azure-vote-1687260329 autoscaled

I was hoping for a good demo of deprecated APIs, but it seems AKS v1.25 was already set to use the new APIs

builder@DESKTOP-QADGF36:~$ kubectl api-resources | grep autosc
horizontalpodautoscalers           hpa                 autoscaling/v2                         true         HorizontalPodAutoscaler
builder@DESKTOP-QADGF36:~$ kubectl api-resources | grep flow
flowschemas                                            flowcontrol.apiserver.k8s.io/v1beta2   false        FlowSchema
prioritylevelconfigurations                            flowcontrol.apiserver.k8s.io/v1beta2   false        PriorityLevelConfiguration
builder@DESKTOP-QADGF36:~$ kubectl api-resources | grep priority
prioritylevelconfigurations                            flowcontrol.apiserver.k8s.io/v1beta2   false        PriorityLevelConfiguration
priorityclasses                    pc                  scheduling.k8s.io/v1                   false        PriorityClass

To ensure we have a deprecated API involved, I’ll manually create the HPA for the Front end using YAML

$ cat avf.hpa.yaml
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: vote-front-azure-vote-1687260329
  namespace: default
spec:
  maxReplicas: 10
  metrics:
  - resource:
      name: cpu
      target:
        averageUtilization: 50
        type: Utilization
    type: Resource
  minReplicas: 3
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vote-front-azure-vote-1687260329

$ kubectl apply -f avf.hpa.yaml
Warning: autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+, unavailable in v1.26+; use autoscaling/v2 HorizontalPodAutoscaler
horizontalpodautoscaler.autoscaling/vote-front-azure-vote-1687260329 created 

Only, AKS caught it and used the newer API!

$ kubectl get hpa vote-front-azure-vote-1687260329
NAME                               REFERENCE                                     TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
vote-front-azure-vote-1687260329   Deployment/vote-front-azure-vote-1687260329   0%/50%    3         10        3          47s

$ kubectl get hpa vote-front-azure-vote-1687260329 -o yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
  ... snip ...

What about the other APIs?

Flow schema is good (at v1beta2 not v1beta1)

$ kubectl get flowschema --all-namespaces
NAME                           PRIORITYLEVEL     MATCHINGPRECEDENCE   DISTINGUISHERMETHOD   AGE     MISSINGPL
exempt                         exempt            1                    <none>                3d14h   False
probes                         exempt            2                    <none>                3d14h   False
system-leader-election         leader-election   100                  ByUser                3d14h   False
endpoint-controller            workload-high     150                  ByUser                3d14h   False
workload-leader-election       leader-election   200                  ByUser                3d14h   False
system-node-high               node-high         400                  ByUser                3d14h   False
system-nodes                   system            500                  ByUser                3d14h   False
kube-controller-manager        workload-high     800                  ByNamespace           3d14h   False
kube-scheduler                 workload-high     800                  ByNamespace           3d14h   False
kube-system-service-accounts   workload-high     900                  ByNamespace           3d14h   False
service-accounts               workload-low      9000                 ByUser                3d14h   False
global-default                 global-default    9900                 ByUser                3d14h   False
catch-all                      catch-all         10000                ByUser                3d14h   False

$ kubectl get flowschema system-leader-election -o yaml
apiVersion: flowcontrol.apiserver.k8s.io/v1beta2
kind: FlowSchema
metadata:
  annotations:

As is priority level

$ kubectl get PriorityLevelConfiguration
NAME              TYPE      ASSUREDCONCURRENCYSHARES   QUEUES   HANDSIZE   QUEUELENGTHLIMIT   AGE
catch-all         Limited   5                          <none>   <none>     <none>             3d14h
exempt            Exempt    <none>                     <none>   <none>     <none>             3d14h
global-default    Limited   20                         128      6          50                 3d14h
leader-election   Limited   10                         16       4          50                 3d14h
node-high         Limited   40                         64       6          50                 3d14h
system            Limited   30                         64       6          50                 3d14h
workload-high     Limited   40                         128      6          50                 3d14h
workload-low      Limited   100                        128      6          50                 3d14h

$ kubectl get PriorityLevelConfiguration global-default -o yaml
apiVersion: flowcontrol.apiserver.k8s.io/v1beta2
kind: PriorityLevelConfiguration
metadata:
  annotations:
...

Upgrading

Regardless of not finding a good way to create deprecated APIs, let’s complete the upgrade

I’ll go to 1.26.0 so I can use the CLI next

In saving, we start the upgrade

We can then see it start to rotate in new nodes and move workloads.

Never during the process does our app actually go down

I pannicked when Chrome started to get an internal server error, but I couldnt find errors in the Kuberenetes logs and firing up an alternate browser showed things were fine

And we can see it is done both in the Azure Portal

Control Plane:

Node Pools:

As well as on the command line

$ az aks list -o table
Name        Location    ResourceGroup    KubernetesVersion    CurrentKubernetesVersion    ProvisioningState    Fqdn
----------  ----------  ---------------  -------------------  --------------------------  -------------------  -----------------------------------------------------------------
idjaksup01  centralus   showAKSUpgrade   1.26.0               1.26.0                      Succeeded            idjaksup01-showaksupgrade-8defc6-2940zsw6.hcp.centralus.azmk8s.io

$ kubectl get nodes
NAME                                STATUS   ROLES   AGE     VERSION
aks-nodepool1-28899529-vmss000000   Ready    agent   9m14s   v1.26.0
aks-nodepool1-28899529-vmss000001   Ready    agent   5m35s   v1.26.0

Using CLI

Let’s use the CLI this time.

We can ask for what is available

$ az aks get-upgrades -g showAKSUpgrade -n idjaksup01 -o table
Name     ResourceGroup    MasterVersion    Upgrades
-------  ---------------  ---------------  ----------
default  showAKSUpgrade   1.26.0           1.26.3

It does warn of blips and confirms I want to do both…

$ az aks upgrade -g showAKSUpgrade -n idjaksup01 --kubernetes-version 1.26.3
Kubernetes may be unavailable during cluster upgrades.
 Are you sure you want to perform this operation? (y/N): y
Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version 1.26.3. Continue? (y/N): y
 \ Running ..

While that is running, we can use kubectl get events -w to watch events

15m         Warning   FailedGetResourceMetric        horizontalpodautoscaler/vote-front-azure-vote-1687260329   failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
15m         Warning   FailedComputeMetricsReplicas   horizontalpodautoscaler/vote-front-azure-vote-1687260329   invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
5m11s       Warning   FailedGetResourceMetric        horizontalpodautoscaler/vote-front-azure-vote-1687260329   failed to get cpu utilization: did not receive metrics for any ready pods
5m11s       Warning   FailedComputeMetricsReplicas   horizontalpodautoscaler/vote-front-azure-vote-1687260329   invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for any ready pods
0s          Normal    Starting                       node/aks-nodepool1-28899529-vmss000001
0s          Warning   FailedGetResourceMetric        horizontalpodautoscaler/vote-front-azure-vote-1687260329   failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
0s          Warning   FailedComputeMetricsReplicas   horizontalpodautoscaler/vote-front-azure-vote-1687260329   invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
0s          Normal    Starting                       node/aks-nodepool1-28899529-vmss000000

You can watch it happen here if you want to see the process:

No Upgrades

If you are at the latest, it’s worth noting the “upgrades” field will be null

$ az aks get-upgrades -g showAKSUpgrade -n idjaksup01 -o table
Table output unavailable. Use the --query option to specify an appropriate query. Use --debug for more info.

$ az aks get-upgrades -g showAKSUpgrade -n idjaksup01
{
  "agentPoolProfiles": null,
  "controlPlaneProfile": {
    "kubernetesVersion": "1.26.3",
    "name": null,
    "osType": "Linux",
    "upgrades": null
  },
  ... snip ...

And the portal will say the same

Terraform

First, we’ll need to know what versions we can use in a region. I’ll be using “East US”, so let’s check what is supported there presently

$ az aks get-versions --location eastus --output table
KubernetesVersion    Upgrades
-------------------  -----------------------
27.1(preview)      None available
26.3               1.27.1(preview)
26.0               1.26.3, 1.27.1(preview)
25.6               1.26.0, 1.26.3
25.5               1.25.6, 1.26.0, 1.26.3
24.10              1.25.5, 1.25.6
24.9               1.24.10, 1.25.5, 1.25.6

Since I want to test upgrades, I’ll choose the lowest version (1.24.9 ).

I’ll set that in a variable

variable "kubernetes_version" {
  type        = string
  default     = "1.24.9 "
  description = "The specific version of Kubernetes to use."
}

Which I can use in my Kubernetes TF

resource "azurerm_kubernetes_cluster" "k8s" {
  location            = azurerm_resource_group.rg.location
  name                = random_pet.azurerm_kubernetes_cluster_name.id
  resource_group_name = azurerm_resource_group.rg.name
  dns_prefix          = random_pet.azurerm_kubernetes_cluster_dns_prefix.id
  kubernetes_version  = var.kubernetes_version

  identity {
    type = "SystemAssigned"
  }

  default_node_pool {
    name       = "agentpool"
    vm_size    = "Standard_D2_v2"
    node_count = var.node_count
  }
  linux_profile {
    admin_username = "ubuntu"

    ssh_key {
      key_data = jsondecode(azapi_resource_action.ssh_public_key_gen.output).publicKey
    }
  }
  network_profile {
    network_plugin    = "kubenet"
    load_balancer_sku = "standard"
  }
}

I wanted to quick sanity check the sizes in the region in case Standard_D2_v2 is no longer there

$ az vm list-sizes --location eastus -o table | grep Standard_D2_v2
8                   7168          Standard_D2_v2             2                1047552           102400
8                   7168          Standard_D2_v2_Promo       2                1047552           102400

I now have a TF Cloud plan to run

The first time through I made a typo on version (“1.24.9 “ with extra space). If you use an errant version, you’ll get an error such as

Failure sending request: StatusCode=0 -- Original Error: Code="BadRequest" Message="Client Error: error parsing version(1.24.9 ) as semver: Invalid characters in version"

While it is still creating in TF Cloud, I can see it exists in Azure already

$ az aks list -o table
Name                 Location    ResourceGroup    KubernetesVersion    CurrentKubernetesVersion    ProvisioningState    Fqdn
-------------------  ----------  ---------------  -------------------  --------------------------  -------------------  --------------------------------------------------
cluster-driven-bear  eastus      idj-east-rg      1.24.9               1.24.9                      Creating             dns-verified-bluejay-q17kaixm.hcp.eastus.azmk8s.io

And when the run is done, I see that in TF Cloud as well

I’ll now add an app so we can see some data

$ az aks get-credentials -n cluster-driven-bear -g idj-east-rg --admin
Merged "cluster-driven-bear-admin" as current context in /home/builder/.kube/config

$ helm install azure-samples/azure-vote --generate-name

NAME: azure-vote-1687520242
LAST DEPLOYED: Fri Jun 23 06:37:23 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Azure Vote application has been started on your Kubernetes cluster.

Title: Azure Vote App
Vote 1 value: Cats
Vote 2 value: Dogs

The externally accessible IP address can take a minute or so to provision. Run the following command to monitor the provisioning status. Once an External IP address has been provisioned, brows to this IP address to access the Azure Vote application.

kubectl get service -l name=azure-vote-front -w

$ kubectl get service -l name=azure-vote-front
NAME               TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)        AGE
azure-vote-front   LoadBalancer   10.0.133.63   52.226.243.78   80:30747/TCP   33s

Upgrade Scenario 1: Upgrading with TF

Let’s say we want to upgrade using terraform.

I’ll first try something that should fail - upgrading more than one version up

$ git diff
diff --git a/variables.tf b/variables.tf
index 2e05a73..5524567 100644
--- a/variables.tf
+++ b/variables.tf
@@ -29,7 +29,7 @@ variable "msi_id" {
 
 variable "kubernetes_version" {
   type        = string
-  default     = "1.24.9"
+  default     = "1.26.0"
   description = "The specific version of Kubernetes to use."
 }

The plan thinks this will work

However, the apply catches it and errors, as I would expect

Applying with the proper version works, but unless I call out the node pool versions, I had to upgrade them via the Portal (it only updated the Control Plane).

We can do the node pools at the same time using orchestrator_version

$ git diff
diff --git a/main.tf b/main.tf
index f3a3c5e..4de9093 100644
--- a/main.tf
+++ b/main.tf
@@ -222,9 +222,10 @@ resource "azurerm_kubernetes_cluster" "k8s" {
   }
 
   default_node_pool {
-    name       = "agentpool"
-    vm_size    = "Standard_D2_v2"
-    node_count = var.node_count
+    name                 = "agentpool"
+    vm_size              = "Standard_D2_v2"
+    node_count           = var.node_count
+    orchestrator_version = var.kubernetes_version
   }
   linux_profile {
     admin_username = "ubuntu"

In fact, setting them now didn’t even show a thing to change in the plan

Upgrade Scenario 2: Upgrading outside of TF

The other way we can go about this is to do our upgrades in the Portal or external to Terraform and then update terraform to match (and then get our state updated for us). It would be effectively a no-op plan and apply.

The first check is to upgrade via the portal:

and see what happens when we don’t adjust the terraform to match.

Here I’ll update a setting on a VM, not touching anything with the AKS cluster

$ git diff
diff --git a/main.tf b/main.tf
index 4de9093..0f95986 100644
--- a/main.tf
+++ b/main.tf
@@ -165,7 +165,7 @@ resource "azurerm_linux_virtual_machine" "my_terraform_vm" {
 
   computer_name                   = "myvm"
   admin_username                  = "azureuser"
-  disable_password_authentication = true
+  disable_password_authentication = false
 
   admin_ssh_key {
     username   = "azureuser"

$ git add main.tf 
$ git commit -m "non AKS Change"
[clone_with_github d73c343] non AKS Change
 1 file changed, 1 insertion(+), 1 deletion(-)

$ git push
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 16 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 303 bytes | 303.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.
To https://github.com/idjohnson/tfAnsibleAzure.git
   a5eab90..d73c343  clone_with_github -> clone_with_github

Sadly, it appears it wants to downgrade (but this should be blocked on apply)

Which it did (though sort of an odd error)

I think the strange error was because the plan kicked off while the cluster was not done upgrading. Doing it a second time gave me more reasonable error message: “Downgrading Kubernetes version 1.26.0 to 1.25.5 is not allowed. Available upgrades: 1.26.3,1.27.1. Please see https://aka.ms/aks-supported-k8s-ver for version details.”

That said, I can update Terraform to match the new portal versions and it’s happy again.

Cleanup

Assuming you don’t want to keep the cluster from our CLI steps, you can delete the cluster to save money

$ az aks delete -g showAKSUpgrade -n idjaksup01
Are you sure you want to perform this operation? (y/n): y
 - Running ..

As well as the group

$ az group delete --resource-group showAKSUpgrade
Are you sure you want to perform this operation? (y/n): y

In Terraform, we comment it out to remove

which should remove it via TF Cloud

And we can see it get removed via the Azure Portal as well

Summary

We did two AKS upgrades, we showed how to find and rotate the Service Principal ID and we attempted to load some deprecated APIs. While we weren’t able to really show deprecation handling (AKS will upgrade it, i might add - we would have seen that), we did see the full upgrade flow.

We also tested how this works with Terraform using Terraform Cloud. We tested a few scenarios including trying to upgrade too far, not updating the versions after portal upgrade and more.

Hopefully this guide assuages any fears one might have about Azure Kubernetes Service and it’s upgrade process.