A managed control plane does not mean a managed security posture. When you provision an Azure Kubernetes Service cluster with the portal defaults or a quickstart command, you get a working cluster in minutes, and almost every control that would keep an attacker out is either off, wide open, or pointed at a public endpoint. To secure an AKS cluster you have to understand that the defaults optimize for “it runs,” not “it is safe,” and the gap between those two states is where most real incidents live. This guide walks the full attack surface of a cluster and the concrete control that closes each opening, in the order an engineer should apply them.
The danger is not theoretical. A cluster created the easy way exposes its Kubernetes API server on a public IP, runs every pod in a flat network where any compromised container can talk to any other, ships with a Kubernetes authorization model that grants more than people assume, stores application credentials as base64 strings that anyone with namespace read access can decode, and admits any container image from any registry without a scan or a policy gate. Each of those is a separate door, and locking one while leaving the rest open buys far less protection than people expect.

This article centers on one organizing idea, the rule that everything else hangs from. Call it the two-RBAC-and-isolation rule: securing AKS means handling both Azure RBAC for the control plane and Kubernetes RBAC inside the cluster, and turning on pod isolation through network policy, because the control plane and the data plane each carry their own access surface that the defaults leave open. Most teams secure one of those two access surfaces, declare victory, and ship a cluster that an auditor or an attacker can walk straight into through the other. Holding both surfaces and the network between them in your head at once is the difference between a cluster that looks secured and one that is.
What you secure in an AKS cluster: the attack surface
Before any single control makes sense, you need an accurate picture of what a cluster is made of and where the boundaries sit. AKS is a managed Kubernetes offering, and the word managed has a precise meaning that determines who is responsible for what. Microsoft runs the control plane, which is the set of components that make scheduling and orchestration decisions: the API server, the scheduler, the controller manager, and the etcd datastore that holds cluster state. You do not patch those, you do not size them directly on the free tier, and you do not get shell access to them. That is the managed part.
Everything else is yours. The node pools, which are the virtual machine scale sets that run your workloads, belong to you in the sense that the operating system, the container runtime configuration, the kubelet posture, and the pods scheduled onto them are your responsibility to keep current and constrained. The networking model, whether you chose kubenet or one of the Azure CNI variants, is a decision you made and now own. The identity and access configuration on both the Azure side and the Kubernetes side is yours to set. The secrets your applications consume are yours to protect. The images your pods pull are yours to vet. This split is the shared-responsibility line for AKS, and reading it correctly is the first security act, because every control in this guide attaches to the part you own.
For a fuller treatment of where that managed boundary sits and how node pools and the network plugin fit together, the model is laid out in Azure Kubernetes Service (AKS) Explained, and this guide assumes that model rather than rebuilding it. The focus here is the security overlay on top of it.
What is the real attack surface of an AKS cluster?
The attack surface is five distinct surfaces, not one. The API server endpoint is the front door to the control plane. The pod network is the lateral movement plane. The two authorization systems decide who can do what. The secrets store decides what a compromised pod can read. The image and admission path decides what code runs at all. A real hardening effort addresses each on its own terms.
These five surfaces do not collapse into a single setting, and that is the trap. There is no master “secure this cluster” toggle. A team that flips on Azure RBAC integration and believes the cluster is now locked down has secured exactly one of the five and left the other four at their permissive defaults. The same goes for a team that builds a beautiful network policy regime while leaving the API server on a public IP and Kubernetes RBAC granting cluster-admin to a service account that a web pod mounts. The surfaces are independent, and the security of the whole is the security of the weakest one.
It helps to think of the surfaces in terms of what an attacker gains by crossing each. Crossing the API server surface gives an attacker the ability to issue commands to the orchestrator: create pods, read secrets, exec into containers, in the worst case schedule a privileged pod on a node and break out to the host. Crossing the pod network surface gives an attacker that has already landed in one container the ability to reach every other container and every internal service, which is how a single vulnerable web frontend becomes a path to a database that should never have been reachable from that frontend. Crossing the authorization surface, whether Azure RBAC or Kubernetes RBAC, gives an attacker more verbs than the breach should have allowed. Crossing the secrets surface turns a read of one namespace into the credentials for everything that namespace talks to. Crossing the image and admission surface lets unreviewed or malicious code into the cluster in the first place. Mapping each surface to the concrete control that narrows it is the entire job, and the rest of this guide does exactly that, surface by surface.
The API server: from public endpoint to private cluster
The Kubernetes API server is the single most important component to lock down, because it is the control plane’s entry point and the thing every kubectl command, every controller, and every attacker who wants orchestration-level access talks to. In a default AKS cluster the API server is reachable over the public internet. It is protected by authentication, so it is not an open door in the sense that anyone can run commands, but it is a publicly resolvable, publicly reachable endpoint, and that means it is constantly subject to credential attacks, it is enumerable, and any flaw in the authentication or authorization path is exploitable from anywhere.
Should I use a private cluster for the API server?
Yes, for any cluster handling sensitive workloads or sitting in a regulated environment. A private cluster removes the public API server endpoint entirely and exposes the API server only through a private endpoint inside your virtual network, so the control plane is reachable from your network and nowhere else. That single change eliminates internet-facing exposure of the most sensitive surface.
The mechanism behind a private cluster is worth understanding rather than treating as a checkbox. When you create a cluster with the private-cluster option, AKS provisions a private endpoint in a node resource group and maps the API server to a private IP address inside your virtual network. A private DNS zone resolves the cluster’s FQDN to that private address. The public FQDN either disappears or resolves only privately, depending on the private DNS configuration you choose. The practical consequence is that kubectl, your CI runners, and your administrators must now reach the API server from inside the virtual network or from a network peered or connected to it, typically through a jump host, a VPN, ExpressRoute, or a self-hosted CI agent that lives in the VNet. This is more operational work, and it is the correct trade for the workloads that warrant it.
If a fully private cluster is too heavy for your situation, the intermediate control is authorized IP ranges. With authorized IP ranges configured, the API server keeps its public endpoint but accepts connections only from the CIDR blocks you list: your office egress, your VPN concentrator, your CI provider’s documented ranges. This does not give you the network isolation of a private cluster, and IP allowlisting is a weaker control than removing the endpoint, but it shrinks the public exposure from the entire internet to a handful of known sources, and it is far better than the default of accepting connections from anywhere. The honest framing is that authorized IP ranges are a mitigation and a private cluster is the real fix; choose the private cluster where you can and use authorized IP ranges where a private cluster genuinely does not fit the operating model.
A point engineers miss: making the cluster private protects the API server endpoint, not the workloads. A private cluster with a public load balancer in front of an ingress controller still exposes your applications to the internet, which is usually what you want for a public-facing app. The private-cluster decision is about the control plane, and the application exposure is a separate decision made at the ingress and load balancer layer. Keeping those two decisions distinct prevents the common confusion where someone believes a private cluster has somehow taken their public website offline, or conversely believes a public cluster means their internal admin API is exposed when it is the ingress configuration that decides that. The ingress side, including TLS termination and the public path into the cluster, is covered in Configure AKS Ingress with NGINX and TLS, and the two layers should be reasoned about separately.
Pod-to-pod isolation: network policy is not on by default
Here is a fact that surprises people every time: in a default AKS cluster, every pod can reach every other pod, across every namespace, with no restriction. Kubernetes networking gives each pod an IP and a flat, fully routable network among them. There is no implicit segmentation, no default deny, nothing that stops a compromised frontend pod from opening a connection to a database pod, a metrics endpoint, or the internal admin service of another team’s application. Network policies are the control that fixes this, and they are not enabled unless you turn them on, deliberately, at cluster creation or by adopting an engine that enforces them.
This is the second leg of the two-RBAC-and-isolation rule, and it is the one teams most often skip because the cluster works fine without it. Applications talk to each other, traffic flows, nothing breaks. The absence of isolation is invisible until the day a single vulnerable container becomes the launch point for lateral movement across the whole cluster, and at that point the flat network is the attacker’s best friend.
How do network policies isolate pods?
A network policy is a Kubernetes object that selects a set of pods by label and declares which ingress and egress traffic is allowed to and from them. Once any policy selects a pod, that pod switches to default-deny for the direction the policy covers, so only the explicitly allowed traffic flows. Policies are additive, and the cluster must run a policy engine to enforce them.
The crucial mechanic is the switch from allow-all to default-deny, and it happens per pod and per direction the moment a policy selects that pod. If you write a single ingress policy that selects your database pods and allows traffic only from your API pods, then from that moment your database pods accept ingress from the API pods and from nothing else, because selecting them for ingress flipped them to deny everything not named. Pods that no policy selects remain fully open, which is why a partial rollout leaves gaps. The disciplined pattern is to start each namespace with an explicit default-deny policy for both ingress and egress, then add allow policies for the connections the application genuinely needs. That inverts the cluster’s posture from “everything is permitted unless forbidden” to “nothing is permitted unless allowed,” which is the posture a security review expects.
AKS gives you a choice of engine to enforce these objects. When you create a cluster you select a network policy provider, and the practical options are Azure Network Policy Manager, Calico, or the Cilium-based dataplane that Azure CNI powered by Cilium provides. The choice matters for what you can express. The basic Azure manager and Calico both enforce the standard Kubernetes NetworkPolicy object, which selects by pod label and namespace. Cilium extends the model with richer policy and identity-aware enforcement, and the Azure CNI powered by Cilium option folds that enforcement into the dataplane so policy is applied efficiently at scale. The non-negotiable part is that you pick one at creation time, because retrofitting a policy engine onto a running cluster is disruptive and sometimes requires a rebuild. Decide before you provision.
A default-deny baseline in a namespace looks like this, and it is the single most valuable object you will apply on the network surface.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: payments
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
That empty podSelector selects every pod in the namespace, and naming both policy types with no allow rules means every pod in the namespace now denies all ingress and all egress. Nothing communicates until you add allow policies on top. The next object opens exactly the path the application needs, and nothing more.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-to-db
namespace: payments
spec:
podSelector:
matchLabels:
app: payments-db
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: payments-api
ports:
- protocol: TCP
port: 5432
Read it as a sentence: pods labeled payments-db accept ingress on TCP 5432 only from pods labeled payments-api, and because a default-deny policy already covers the namespace, everything else stays blocked. Build the application’s real communication graph this way, edge by edge, and the network stops being a free lateral-movement plane and becomes a set of named, reviewable connections. This is tedious the first time and routine after that, and it is exactly the kind of work a hands-on lab makes faster to internalize; the VaultBook Azure labs include AKS exercises where you apply a default-deny baseline and then verify, from inside a pod, that blocked connections actually fail and allowed ones succeed, which is the verification that turns network policy from a YAML file you hope works into a control you have watched enforce.
What about egress and the control plane traffic?
Egress policy is the half teams forget. Locking down ingress stops unwanted connections into your pods, but a compromised pod’s most useful move is often outbound: exfiltrating data, calling a command-and-control host, or reaching an internal service it should never touch. An egress default-deny with explicit allows for the destinations a pod legitimately needs (its database, a specific API, the cluster DNS) closes that path. The one piece you must remember to allow is DNS resolution to the cluster’s CoreDNS service, because a pod that cannot resolve names will fail in confusing ways that look like application bugs rather than policy. A correct egress baseline denies everything, then explicitly allows UDP and TCP 53 to the kube-system DNS pods and the specific application destinations, and nothing else.
There is a layer below network policy worth naming, because people conflate the two. Network security groups operate at the subnet and network-interface level in Azure, and they govern traffic at the VNet layer, including traffic into and out of the node subnet. Network policies operate inside the cluster at the pod level. They are complementary, not redundant. A network security group can keep the node subnet from talking to the wider network, and a network policy can keep pods within the cluster from talking to each other. A complete posture uses both: the NSG as the coarse VNet-level boundary and network policy as the fine pod-level segmentation. Treating one as a substitute for the other leaves a gap at whichever layer you skipped.
The two RBAC layers: where most clusters are quietly wide open
This is the heart of the namable claim and the place where the most dangerous misunderstanding lives. An AKS cluster has two completely separate authorization systems, and securing one does nothing for the other. The first is Azure RBAC, which governs Azure-level operations on the cluster resource: who can read the cluster, scale node pools, retrieve credentials, and manage the AKS resource through Azure Resource Manager. The second is Kubernetes RBAC, which governs operations inside the cluster: who can list pods, read secrets, create deployments, and exec into containers through the Kubernetes API. These are different planes with different identities, different role definitions, and different blast radii, and a cluster is only as secure as the weaker of the two.
How do Azure RBAC and Kubernetes RBAC secure access?
Azure RBAC controls Azure-plane actions on the cluster resource, such as retrieving cluster credentials and scaling node pools, through Azure role assignments. Kubernetes RBAC controls in-cluster actions, such as reading secrets and creating pods, through Roles and RoleBindings inside Kubernetes. They are independent systems, and a tight assignment on one plane grants nothing and protects nothing on the other.
Consider how an administrator actually reaches the cluster, because the two systems hand off in a specific place. When someone runs the command to fetch cluster credentials, that action is an Azure-plane operation gated by Azure RBAC: they need a role that permits listing the cluster’s user or admin credentials. Once they have a kubeconfig and start issuing kubectl commands, every one of those commands is an in-cluster operation gated by Kubernetes RBAC. So the question “can this person read my production secrets” has two halves. Can they get a working kubeconfig at all, which Azure RBAC decides, and once they have one, what does the Kubernetes authorization model let them do, which Kubernetes RBAC decides. A common and serious mistake is to lock down the Azure side carefully, restricting who can fetch credentials, while leaving the in-cluster side at a permissive default where any authenticated identity maps to broad permissions. The Azure side then looks audited and tidy, and the in-cluster side is where the real authority lives, unconstrained.
The cleanest configuration available today is AKS-managed Microsoft Entra integration with Azure RBAC for Kubernetes authorization. In that mode, the identities that authenticate to the cluster are Entra identities, the same users and groups and workload identities you already govern, and you assign their in-cluster permissions through Azure role assignments scoped to the cluster or a namespace, using built-in roles such as the cluster user, the RBAC reader, the RBAC writer, and the RBAC admin. This unifies the identity source so you are not maintaining a separate set of Kubernetes-only identities, and it lets you manage in-cluster authority with the same Azure role assignments and the same review process you use for everything else in the subscription. It does not erase the two-plane distinction, the Azure-resource actions and the in-cluster actions are still separate, but it gives you one identity provider and one assignment surface for both, which is far easier to reason about and audit than running local Kubernetes accounts alongside Entra.
When you instead use Entra integration with native Kubernetes RBAC, authentication comes from Entra but authorization comes from Kubernetes Role, ClusterRole, RoleBinding, and ClusterRoleBinding objects that you write and apply. This is more granular and is the right choice when you need fine-grained, namespace-level rules that the Azure built-in roles do not express, and it is more work to keep correct because the role objects live in the cluster and must be version-controlled and reviewed like any other manifest. A Role grants verbs on resources within a single namespace; a ClusterRole grants them cluster-wide; the bindings attach a role to a subject, which is an Entra user, group, or service account. The most consequential rule to internalize is the difference between a RoleBinding and a ClusterRoleBinding: a ClusterRoleBinding to a high-privilege ClusterRole grants that authority in every namespace, and binding a broad ClusterRole like cluster-admin to a group “just to unblock the team” is how clusters end up with a dozen people holding full control they were never meant to have.
How should I scope Kubernetes RBAC for least privilege?
Grant the narrowest verbs on the narrowest resources in the narrowest scope that lets the work happen. Use namespaced Roles and RoleBindings by default and reserve ClusterRoles for genuinely cluster-wide needs. Never bind cluster-admin to a person or a workload as a shortcut, and never let an application service account hold permissions beyond the exact resources its code touches.
Least privilege on the Kubernetes plane is concrete, not aspirational. A developer who deploys to the payments namespace needs a Role in that namespace permitting get, list, watch, create, update, and delete on deployments, pods, services, and configmaps, bound with a RoleBinding to their Entra group, and they need nothing in other namespaces and nothing at the cluster scope. A read-only on-call viewer needs get, list, and watch on pods and logs in the namespaces they support, and not the ability to exec into a container or read a secret. A CI deployer needs to apply manifests to its target namespace and nothing more. Writing these roles narrowly is more effort than binding everyone to a broad ClusterRole, and that effort is the security. The test for any role is simple: if the identity it binds to were fully compromised tomorrow, what could the attacker do, and is that the smallest blast radius you can arrange while letting the legitimate work proceed.
Service accounts deserve special attention because they are the identities your workloads run as, and they are the identities an attacker inherits the instant they compromise a pod. Every pod runs as a service account, and if you do not specify one it runs as the namespace default. By default a pod gets that service account’s token mounted into the container filesystem, which means a compromised pod can present that token to the API server and do whatever the service account is permitted to do. Two disciplines follow. First, set automountServiceAccountToken to false on any pod that does not need to call the Kubernetes API, which is most application pods, so a compromise yields no usable API token. Second, for the pods that do need API access, give each its own dedicated service account bound to a Role that grants only the verbs that pod’s code actually uses, so the token a compromise yields is nearly worthless. The default service account in every namespace should have no role bindings at all, so that inheriting it grants nothing.
Secrets and workload identity: stop storing credentials in the cluster
A Kubernetes Secret feels like a secure place to put a database password or an API key. It is not. A Kubernetes Secret is base64-encoded data stored in etcd, and base64 is an encoding, not encryption; anyone who can read the secret object can decode it instantly. While AKS does encrypt etcd at rest on the platform, that protects the disk, not the access path. The access path is Kubernetes RBAC, and if any identity can get or list secrets in a namespace, that identity can read every credential stored there in plaintext. Storing your real production credentials as plain Kubernetes secrets means their security is entirely the security of your secrets RBAC, which on most clusters is far looser than people assume because so many default roles include secret read access.
How do I handle secrets and workload identity securely?
Keep credentials in Azure Key Vault rather than in Kubernetes Secrets, and give pods access to them through workload identity rather than a stored credential. Workload identity federates a Kubernetes service account with a Microsoft Entra identity, so a pod authenticates to Key Vault using a short-lived token tied to its service account, with no secret stored in the cluster at all.
Workload identity is the modern answer and it changes the shape of the problem. Instead of putting a credential somewhere and then protecting that location, you remove the stored credential entirely. The mechanism is federated identity: you create an Entra workload identity, you establish a federated credential that trusts tokens issued by the cluster’s OIDC issuer for a specific Kubernetes service account in a specific namespace, and you grant that Entra identity access to the Key Vault secrets it needs. At runtime, a pod using that service account receives a projected, short-lived service account token, exchanges it for an Entra token through the federation trust, and uses that Entra token to read the secret from Key Vault. No password lives in a manifest, no key sits in a Kubernetes Secret, and the token the pod holds expires quickly and is scoped to exactly that identity. A compromised pod can only reach what its federated identity was granted, and only for as long as its token is valid. The full setup path, including the OIDC issuer, the federated credential, and the service account annotations, is walked through in Set Up Workload Identity in AKS, and it is the path every new cluster should adopt from the start.
The bridge that makes Key Vault secrets appear to a pod as files or environment variables, when you need that ergonomics, is the Secrets Store CSI driver with the Azure Key Vault provider. It mounts secrets from Key Vault into the pod as a volume, fetching them at pod start using the pod’s workload identity, so the secret material comes from Key Vault on demand rather than living in a Kubernetes Secret. You get the developer experience of a mounted secret without the credential ever resting in etcd. If you enable secret rotation on the driver, the mounted values refresh on an interval, so a rotated Key Vault secret reaches the pods without a redeploy. This combination, workload identity plus the CSI driver, is the pattern that retires plain Kubernetes secrets for good while keeping applications simple.
Centralizing secrets in Key Vault also gives you the controls a secrets platform should have and a Kubernetes Secret never will: access policies or Azure RBAC on the vault, full audit logging of every secret read, soft-delete and purge protection so a deleted secret is recoverable, versioning, and managed rotation. The principle is to treat the cluster as a consumer of secrets, never a store of them. The store is Key Vault, the identity is workload identity, and the cluster holds nothing it would hurt to lose. The deeper treatment of vault access models, network exposure, and rotation lives in the Key Vault material in the series, and the takeaway for AKS specifically is that the moment you find yourself typing a real credential into a Kubernetes manifest, you have chosen the wrong pattern.
Image and admission policy: deciding what is allowed to run
Every other control assumes the code already running in your pods is code you meant to run. The image and admission surface is where you enforce that. By default an AKS cluster will pull and run any image from any registry your nodes can reach, with no check on where it came from, what is inside it, or whether it carries known vulnerabilities. That default makes the supply chain an open question, and it is the surface that turns a compromised base image or a typosquatted dependency into running code inside your cluster.
How do I enforce image scanning and admission policy?
Scan images for vulnerabilities before and after they reach the registry, and use an admission controller such as Azure Policy for AKS to reject pods that violate your rules, for example images from untrusted registries, containers running as root, or privileged pods. Scanning finds the known weaknesses, and admission policy enforces the rules that stop a non-compliant pod from ever scheduling.
Scanning and admission are two separate jobs and you want both. Image scanning, through Microsoft Defender for Containers and the vulnerability assessment it provides, inspects images in your Azure Container Registry and the images actually running in the cluster, and reports the known vulnerabilities it finds. That is detection: it tells you that the base image in your payments service carries a critical CVE so you can rebuild it. Admission policy is prevention: it sits in the path of every pod creation request and rejects the ones that break your rules before they run. The two work together because scanning without enforcement produces a report nobody acts on, and enforcement without scanning blocks on the rules you wrote but tells you nothing about the vulnerabilities inside images that pass those rules.
Azure Policy for AKS is the managed admission-control path, and it works by deploying the Open Policy Agent Gatekeeper admission controller into the cluster and translating Azure Policy definitions into Gatekeeper constraints. You assign policies at the subscription or resource-group level, and they enforce in the cluster. The built-in policy set covers the rules most teams want first: disallow privileged containers, disallow host namespace sharing, require that containers run as a non-root user, restrict the registries images may come from, enforce resource limits, and block the dangerous capabilities and host-path mounts that make a container breakout easy. You can run a policy in audit mode first, which reports violations without blocking, then switch it to deny once you have fixed the existing workloads that would fail. That audit-then-deny rollout is the safe way to introduce admission control onto a running cluster without breaking deployments on day one.
The single most valuable admission rule for most clusters is the registry allowlist, because it cuts off the entire category of arbitrary-image risk. If your policy only admits images from your own Azure Container Registry, then an attacker who somehow gets the ability to create a pod still cannot point it at a malicious public image, and a developer cannot accidentally ship a random image from an untrusted source. Pair that with the non-root and no-privileged rules and you have closed the most common paths from “ran a container” to “controlled the node.” VaultBook’s AKS labs include exercises that walk through assigning an Azure Policy initiative to a cluster, watching a non-compliant pod get rejected at admission, and then adjusting the workload to pass, which is the loop that makes admission policy concrete rather than abstract.
Controlling outbound traffic to the internet
Network policy governs traffic among pods, but a separate question is what the cluster as a whole may reach on the public internet, and the default answer is everything. By default the nodes have unrestricted outbound access, which means a compromised workload can call out to any endpoint to exfiltrate data or fetch a second-stage payload. Egress control closes that, and it is a surface teams often discover only after a security review flags it.
The mechanism is to route the node subnet’s outbound traffic through a controlled choke point, typically Azure Firewall, using a user-defined route that sends internet-bound traffic to the firewall rather than straight out. The firewall then applies allow rules, including fully qualified domain name filtering, so the cluster can reach the specific destinations it needs, the Azure platform endpoints, your container registry, the package sources your build process uses, and is denied everything else. AKS documents the exact set of platform FQDNs and ports the cluster requires to function, and the disciplined approach is to allow that required set plus your application’s known destinations, then default-deny the rest. The result is that even a fully compromised pod cannot reach an arbitrary external command-and-control host, because the path off the node passes through a firewall that does not permit it.
This pairs with the network policy egress rules rather than replacing them. The egress network policy controls which destinations a pod may attempt at the pod level inside the cluster, and the firewall controls what actually leaves the node subnet at the VNet level, so a connection has to satisfy both. As with the node subnet network security group, the layering is the point: pod-level egress policy and VNet-level egress firewalling catch different things, and a thorough posture uses both. The trade-off is operational, because FQDN allowlists need maintenance as the application’s legitimate destinations change, and a too-tight rule will break a deployment when a build suddenly needs a source the firewall does not permit. The honest framing is that egress control is higher-effort than the inbound controls and is most warranted where data exfiltration is a serious concern, such as clusters handling regulated or sensitive data, where the cost of the maintenance is small against the value of cutting off the exfiltration path.
Node and upgrade hygiene: the surface that decays on its own
The controls above are configuration you set once and review. Node and upgrade hygiene is different, because it decays whether or not you touch the cluster. A node image that was current at provisioning accumulates unpatched kernel and OS vulnerabilities over time, a Kubernetes version drifts out of support, and a cluster that was hardened a year ago and never upgraded is now running known-vulnerable components. Hygiene is the maintenance discipline that keeps the rest of the posture from rotting.
Three mechanisms carry most of the load. Node image upgrades replace the OS image on your nodes with the latest patched version, and AKS can apply these automatically on a channel you choose, so security patches reach the nodes without a manual cycle. Kubernetes version upgrades move the cluster’s control plane and node pools to a newer, supported Kubernetes release, and staying within the window of supported versions matters because out-of-support versions stop receiving security fixes. Automatic upgrade channels let you opt into having AKS manage these on a cadence, with the patch channel applying node OS patches and the stable or rapid channels handling Kubernetes minor versions, balanced against maintenance windows you define so upgrades land when you expect them rather than mid-incident.
The hardening choice within node hygiene is to minimize what runs on the node in the first place. Using a hardened, minimal node OS reduces the installed package surface an attacker can exploit after a breakout. Disabling SSH access to nodes, which AKS supports, removes a direct management path that is rarely needed in a properly automated cluster and is a frequent target. Constraining what pods can do at the node level through the admission policies above, no privileged containers, no host namespace sharing, no arbitrary host-path mounts, means that even a fully compromised container has a hard time reaching the node it runs on. The node is the boundary between a contained pod compromise and a host compromise, and these measures keep that boundary intact.
There is an interaction with the rest of the posture worth naming. Upgrades and node image refreshes will recreate nodes and reschedule pods, and a cluster that depends on local node state, manually applied node changes, or pods that cannot tolerate being moved will fight the very maintenance that keeps it secure. A cluster designed so that any node can be replaced at any time, with workloads that survive rescheduling and configuration applied declaratively, is a cluster you can keep patched without drama. Security hygiene and operational hygiene are the same discipline here: the cluster you can upgrade safely is the cluster you can keep secure.
Least privilege applied across the whole cluster
Least privilege is not one setting, it is a posture you apply on every surface, and it is worth gathering the specific applications in one place because they reinforce each other. On the Azure plane, the identities that can fetch cluster credentials or manage the cluster resource should be scoped to the people and pipelines that genuinely need them, using Azure role assignments at the cluster scope rather than broad subscription-level grants. On the Kubernetes plane, every human and every workload holds a Role scoped to the namespace and verbs it needs and nothing wider, with cluster-admin reserved for break-glass and bound to no one by default. On the network, every pod sits behind a default-deny baseline with explicit allows only for its real connections. On secrets, every workload’s federated identity grants access to only the specific Key Vault secrets its code reads. On the node, every container runs as non-root, unprivileged, and without host access unless a reviewed exception says otherwise.
The reason to think of these together is that least privilege compounds. A single control applied in isolation limits one thing. All of them applied together mean that a compromise at any one point yields almost nothing useful, because the next surface the attacker reaches is also constrained. A compromised web pod that runs as non-root, holds no mounted service account token, sits behind an egress default-deny, and whose namespace secrets live in Key Vault behind a federated identity it was not granted, is a compromise that goes nowhere. That is the design goal: not to make any single breach impossible, which no control can promise, but to make every breach small. This is also the principle that connects AKS hardening to the broader strategy in Azure Zero Trust Architecture Explained, where the same logic of explicit verification and least privilege per request governs the whole estate, and an AKS cluster is one well-segmented citizen of it.
The common misconfigurations and the breaches they enable
It helps to see the failure patterns as named cases, because engineers report the same ones repeatedly and each maps to a specific hardening step. Naming them turns a vague worry into a checklist you can run against your own cluster.
The first pattern is the public API server that should have been a private cluster. The cluster works, kubectl is convenient from anywhere, and the control plane’s most sensitive endpoint sits on the public internet absorbing credential-stuffing attempts and waiting for an authentication flaw. The fix is the private cluster, or at minimum authorized IP ranges, applied at creation time.
The second pattern is the flat network with no policy, where every pod can reach every other pod. The breach it enables is lateral movement: one vulnerable container becomes a path to every service in the cluster. The fix is a default-deny network policy baseline per namespace with explicit allows for the real communication graph, enforced by a policy engine chosen at creation.
The third pattern is the tight Azure RBAC with wide-open Kubernetes RBAC. The Azure side is audited and restrictive, and inside the cluster an identity maps to broad permissions, often through a ClusterRoleBinding to a high-privilege role that was added to unblock someone. The breach it enables is that anyone who gets a kubeconfig, which the Azure side was supposed to be the only gate on, can do far more than intended once inside. The fix is to treat the two RBAC planes as separate and apply least privilege to both, ideally through Entra integration with Azure RBAC for Kubernetes authorization so the two share one identity source.
The fourth pattern is plain Kubernetes secrets where workload identity belongs. Real credentials sit base64-encoded in etcd, readable by anyone with secret access in the namespace, with no audit trail on reads and no rotation. The breach it enables is that a single namespace read yields every downstream credential. The fix is Key Vault plus workload identity, with the cluster holding no durable secrets at all.
The fifth pattern is unscanned images admitted without policy, where any image from anywhere runs without a check. The breach it enables is a compromised or vulnerable image becoming running code, the supply-chain entry point. The fix is Defender for Containers scanning plus Azure Policy admission rules, starting with a registry allowlist and the non-root and no-privileged constraints.
The sixth pattern is neglected node and version upgrades, where the cluster was hardened once and left to drift into running known-vulnerable components. The breach it enables is the exploitation of a patched-elsewhere vulnerability that this cluster never received the fix for. The fix is automatic upgrade channels for both node OS patches and Kubernetes versions, with maintenance windows, plus a minimal hardened node OS and disabled node SSH.
Read together, the six patterns are simply the five surfaces plus the decay problem, each described as the mistake teams actually make. If your cluster does not match any of the six, you have done the work. If it matches one, you have found your next task.
A worked example: the same breach in two clusters
Abstractions land better against a concrete scenario, so trace one realistic incident through an unhardened cluster and then through a hardened one. The starting point is the same: a public-facing web application running in a pod has a remote-code-execution vulnerability in a dependency, and an attacker exploits it to get shell access inside that one container. This is a common entry point and it is not preventable by cluster configuration alone, because the flaw is in the application. What differs entirely is what happens next, and the difference is the whole argument for hardening.
In the unhardened cluster, the attacker lands in the web pod and immediately finds a service account token mounted in the container, because automounting was left at its default. The token belongs to the namespace default service account, which a previous engineer bound to a broad role to unblock a deployment, so the attacker presents the token to the API server and discovers they can list and read secrets across the namespace. Those secrets are plain Kubernetes Secrets holding the production database password and a third-party API key, both decoded in seconds. The network is flat, so the attacker opens a direct connection from the web pod to the database pod and to an internal admin service that was never meant to be reachable from the frontend. The API server is public, so the attacker can also use the stolen token from their own machine outside the cluster. Within minutes, a single application bug has become full namespace compromise and a foothold for lateral movement, and nothing in the cluster’s configuration slowed any step of it.
In the hardened cluster, the attacker lands in the same web pod and finds no service account token, because the pod was set to not automount one it does not need. There is nothing to present to the API server. The attacker tries to reach the database pod directly and the connection is refused, because the namespace runs a default-deny network policy and the only allowed path to the database is from the API pod on a specific port, not from the web pod. The attacker looks for stored credentials and finds none, because the application reads its database connection from Key Vault at startup through workload identity, and the short-lived token that fetched it has already expired and was scoped to secrets the web identity, not this pod, was granted. Meanwhile the exec into the container and the unusual outbound probing have generated runtime alerts in Defender for Containers, and the security team is already looking at the pod. The same application bug has become a contained incident in a single pod with no path forward, detected while it was small. The vulnerability was identical; the blast radius was the entire difference, and the blast radius is exactly what the controls in this guide govern.
Where teams get the rollout order wrong
Knowing the controls is not the same as sequencing them, and a wrong order causes self-inflicted outages that sour a team on security work. The two surfaces that must be decided at creation, because they cannot be retrofitted without a rebuild, are the network policy engine and the cluster’s private-versus-public mode along with Entra integration. Picking those after the cluster is running means provisioning a new cluster and migrating, which is why they belong at the very front of the plan, before a single workload deploys. Teams that skip this and try to add a policy engine later discover the cost the hard way.
The controls that can be added to a running cluster should be rolled out in a sequence that avoids breaking the workloads already there. Admission policy is the clearest case: switching a deny policy on against a cluster full of non-compliant pods will block the next deployment of every one of them, so the correct order is to assign the policy in audit mode, read the report of what would fail, fix those workloads, and only then switch to deny. Network policy has the same trap inverted, where applying a default-deny baseline without first having mapped and allowed the real communication graph, including DNS egress, will sever connections the application depends on and produce failures that look like application bugs. The order there is to observe the traffic, write the allow rules, apply them, and add the default-deny last, so the deny lands on a namespace whose legitimate traffic is already permitted.
RBAC tightening has its own ordering risk. Pulling broad permissions away from identities that quietly depended on them breaks pipelines and on-call workflows, so the move is to inventory what each identity actually uses, write the narrow roles, grant them alongside the broad ones, confirm everything still works, and then remove the broad grants. Secrets migration follows the same shape: stand up the Key Vault path and workload identity, point the application at it, confirm it reads correctly, and only then delete the old Kubernetes Secret, never the reverse. The principle across all of them is to add the new constraint in a non-breaking mode, prove the legitimate work survives it, and then remove the old permissive path, which turns a hardening rollout from a series of outages into a series of quiet, reviewable changes.
How to verify the posture: proving the controls actually hold
A control you have configured is not a control you have verified. The gap between “I applied the setting” and “the setting does what I think” is where false confidence lives, and security verification is the practice of closing it by testing each surface from the attacker’s side rather than reading your own configuration back to yourself.
How do I verify my AKS cluster is actually hardened?
Test each surface as an attacker would. Confirm the API server rejects connections from outside the allowed network, exec into a pod and confirm blocked egress fails while allowed egress succeeds, attempt an action your RBAC should deny and confirm it is denied, check that pods hold no usable service account token, and try to schedule a non-compliant pod and confirm admission rejects it.
Take the surfaces one at a time. For the API server, the verification is a connection attempt from outside your authorized network or, for a private cluster, from outside the VNet, and the correct result is that it cannot reach the endpoint at all. If you can run kubectl from your laptop on a coffee-shop network and it works against a cluster you believe is private or IP-restricted, the control is not in force. For network policy, the verification runs from inside the cluster: exec into a pod that a policy should isolate, attempt a connection to a destination the policy forbids, and confirm it times out or is refused, then attempt an allowed connection and confirm it succeeds. A policy that blocks nothing because no policy selects the pod, or because the engine was never enabled, fails this test silently otherwise. For RBAC, the verification is to assume an identity, through impersonation or by acquiring its actual token, and attempt an action it should not be allowed, such as reading a secret as a viewer or deploying to a namespace a developer has no role in, and confirm the API server returns a forbidden response. The kubectl auth can-i command answers these questions directly and should be part of any review: asking can-i for a sensitive verb as a given subject tells you immediately whether the role you wrote grants what you intended.
For secrets, the verification is to confirm that the pods running your workloads do not have credentials sitting in their environment or in mounted Kubernetes secrets, and that reads against Key Vault appear in the vault’s audit log tied to the workload identity, which proves the federated path is the one in use rather than a stray stored key. For admission policy, the verification is to attempt to create a pod that violates a rule, a privileged pod, a root container, an image from a disallowed registry, and confirm the API server rejects it at admission with a message naming the policy. For node hygiene, the verification is to check the node image and Kubernetes versions against the current supported versions and confirm your upgrade channels are applying patches on the cadence you set. Each of these is a small test, and run together on a schedule they convert your security posture from a set of intentions into a set of demonstrated facts. The VaultBook labs are built around exactly this kind of verification loop, where you apply a control and then prove it from the side that matters, which is the habit that separates a cluster that is secured from one that is merely configured.
Making it auditable and repeatable: hardening as code
A cluster hardened by hand is hardened until someone changes it, and nobody can prove what state it is in without logging into it. The final discipline is to make the entire posture declarative and auditable, so the secure configuration is defined as code, applied by automation, and reported on continuously. This is what turns a one-time hardening effort into a posture that survives team changes, cluster rebuilds, and the slow entropy of manual edits.
Express the cluster’s security configuration as infrastructure as code. The cluster itself, with private-cluster mode, the network policy engine, Entra integration with Azure RBAC, the OIDC issuer and workload identity enabled, and the node configuration, should be defined in a Bicep or Terraform definition that is version-controlled and reviewed, so the secure settings are not flags someone remembered to set but properties of the committed definition. The Kubernetes RBAC roles and bindings, the network policies, and the service account configurations should live as manifests in a repository and reach the cluster through a pipeline or a GitOps controller, so the in-cluster authorization and isolation state is reviewable in version control and any drift from it is visible and reversible. The admission policies should be Azure Policy assignments at the subscription or resource-group scope, so they apply to every cluster in scope by default and a new cluster inherits the rules rather than starting permissive.
Layer continuous reporting on top. Microsoft Defender for Containers provides ongoing assessment of the cluster’s security posture, surfacing misconfigurations, vulnerable images, and risky settings against a baseline, and Azure Policy in audit mode reports every workload that violates a rule even where you have not yet switched it to deny. Send the cluster’s audit logs, the Kubernetes API audit log and the Azure activity log for the cluster resource, to a Log Analytics workspace so that who did what, on both planes, is queryable and retained. The goal is that at any moment you can answer three questions from evidence rather than memory: what is the cluster’s configuration, who has what access on both planes, and what has anyone done recently. A posture you can answer those questions about is a posture you can defend in an audit and recover after an incident, and a posture you cannot is one you are only hoping is intact.
The repeatability payoff is concrete. When the secure configuration is code, standing up a second cluster, recovering from a region failure, or onboarding a new team’s namespace is a matter of applying the same reviewed definitions, not re-deriving the hardening from memory and hoping nothing was missed. The cluster you can rebuild identically is the cluster whose security you actually control.
The InsightCrunch AKS hardening checklist
This is the findable artifact for this guide: the InsightCrunch AKS hardening checklist, organized by surface, with the control and the reason at each step. Work it top to bottom on a new cluster, and run it as a review against an existing one. The reason column matters as much as the control, because a control you apply without understanding is a control you will disable the first time it inconveniences someone.
| Surface | Control | Why it matters |
|---|---|---|
| API server | Provision as a private cluster, or apply authorized IP ranges where private is not feasible | Removes or shrinks public exposure of the control plane’s entry point, the most sensitive endpoint |
| API server | Reach the API server through a VNet path (jump host, VPN, self-hosted CI agent) | Keeps control-plane access on a known, auditable network path rather than the open internet |
| Network | Choose a network policy engine at creation (Azure NPM, Calico, or Cilium dataplane) | The engine cannot be retrofitted without disruption, so the decision must precede provisioning |
| Network | Apply a default-deny ingress and egress baseline per namespace | Inverts the cluster from allow-all to deny-all, so only named connections flow |
| Network | Add explicit allow policies for the real communication graph, including DNS egress | Restores exactly the traffic the application needs and nothing more, including resolution |
| Network | Use network security groups at the node subnet as the coarse VNet boundary | Complements pod-level policy with subnet-level isolation; the two are not substitutes |
| Azure RBAC | Scope cluster-credential and management roles to the people and pipelines that need them | Controls who can obtain a kubeconfig at all, the Azure-plane gate on cluster access |
| Kubernetes RBAC | Enable Entra integration with Azure RBAC for Kubernetes authorization | Unifies identity to Entra and manages in-cluster authority through reviewable Azure assignments |
| Kubernetes RBAC | Use namespaced Roles by default; reserve ClusterRoles; bind cluster-admin to no one | Keeps the blast radius of any single identity to the narrowest scope that works |
| Kubernetes RBAC | Set automountServiceAccountToken to false for pods that do not call the API | Denies a compromised pod a usable API token, the credential an attacker inherits first |
| Kubernetes RBAC | Give API-calling pods dedicated service accounts with minimal Roles | Makes the token a pod compromise yields nearly worthless |
| Secrets | Store credentials in Azure Key Vault, not in Kubernetes Secrets | A Kubernetes Secret is base64, not encrypted at the access layer; Key Vault adds audit and rotation |
| Secrets | Grant pods access through workload identity federated with Entra | Removes stored credentials entirely; pods authenticate with short-lived, scoped tokens |
| Secrets | Mount Key Vault secrets through the Secrets Store CSI driver where files are needed | Gives developers a mounted-secret experience without resting credentials in etcd |
| Images | Scan registry and running images with Defender for Containers | Detects known vulnerabilities in the images you actually run |
| Images | Enforce admission rules with Azure Policy: registry allowlist, non-root, no privileged | Stops non-compliant or untrusted images from scheduling, closing the supply-chain entry |
| Images | Roll admission policy out in audit mode, then switch to deny | Introduces enforcement without breaking existing workloads on the first day |
| Nodes | Enable automatic upgrade channels for node OS patches and Kubernetes versions | Keeps the cluster on supported, patched components as it ages |
| Nodes | Use a minimal hardened node OS and disable node SSH | Shrinks the post-breakout surface and removes a rarely needed management path |
| Posture | Define the cluster and its RBAC, network, and admission config as version-controlled code | Makes the secure state reproducible, reviewable, and recoverable rather than manual |
| Posture | Send API and activity audit logs to Log Analytics; run Defender posture assessment | Lets you answer what the config is, who has access, and what was done, from evidence |
The checklist is deliberately ordered so that the surfaces an attacker reaches first, the API server and the network, come before the ones a compromise reaches later, and the posture row at the end is what keeps the whole thing from drifting. A cluster that clears every row is not invulnerable, because nothing is, but it is a cluster where every breach is small and every control is something you can prove rather than something you hope is true.
When are the defaults acceptable?
It would be dishonest to imply that every cluster must adopt every control on day one regardless of context, so it is worth stating where a lighter posture is a defensible engineering choice rather than negligence. A throwaway development cluster with no real data, no production credentials, and a short life can reasonably run with the public API server behind authorized IP ranges rather than a full private cluster, because the operational cost of private networking is real and the data at risk is nil. A single-team cluster running only that team’s own workloads, with no cross-tenant isolation requirement, has less need for aggressive namespace network segmentation than a shared platform cluster hosting many teams, though even there an egress default-deny is cheap insurance against exfiltration.
The honest line is that the controls scale with the stakes, and the mistake is not choosing a lighter posture deliberately for a low-stakes cluster, it is carrying a low-stakes posture into a high-stakes cluster by inertia. The cluster that started as a prototype and quietly became the thing running customer data is the dangerous case, because the posture never got revisited when the stakes changed. The discipline that prevents it is to tie the posture to a classification of what the cluster holds, and to re-run the hardening checklist whenever that classification rises. A control you skipped knowingly for a prototype is fine; a control you skipped and forgot when the prototype went to production is the incident waiting to happen.
There is also a counter-reading worth engaging directly, because it sounds reasonable and is wrong. The argument goes: the control plane is managed by Microsoft, etcd is encrypted, the platform is hardened, so the heavy in-cluster work is redundant. The flaw is that platform hardening protects the parts Microsoft owns, and every surface in this guide is a part you own. The managed control plane does not write your network policies, scope your Kubernetes RBAC, choose where your secrets live, or decide which images admit. Trusting the platform for the platform’s responsibilities is correct; extending that trust to your own responsibilities is the exact mistake the shared-responsibility model exists to prevent.
Runtime detection: catching what prevention misses
Every control to this point is preventive, and prevention has a ceiling. Admission policy stops the pod you defined a rule against; it does nothing about the legitimate pod that gets compromised at runtime through an application vulnerability. Network policy constrains the connections you forbade; a clever attacker works within the allowed graph. The posture is incomplete without detection, which assumes a breach will eventually happen and aims to notice it while it is small.
Microsoft Defender for Containers provides the runtime threat detection layer for AKS, and it watches the cluster’s behavior for the signatures of an attack in progress: a shell spawned inside a container that should never run one, a process reaching for the service account token, a connection to a known malicious endpoint, a privilege escalation attempt, a crypto-mining pattern, an exec into a pod from an unusual source. These alerts fire on activity, not configuration, so they catch the compromise that slipped past every preventive gate. The value is in the timeline: an attacker who lands in a pod and starts probing produces detectable behavior before they reach anything important, and an alert at that moment is the difference between a contained incident and a breach report.
Detection only matters if someone sees the alert and can act, which loops back to the auditability discipline. Route Defender alerts and the Kubernetes audit log to the same place your security team already watches, tune out the noise so the real signals are not buried, and rehearse the response so that an alert about a suspicious exec into a payments pod triggers a known sequence rather than a scramble. The audit log is also your forensic record: when something does happen, the question “what did this identity do, on which plane, when” is answerable only if you captured the logs before you needed them. Detection without logging is an alarm with no recording, and logging without detection is a recording nobody plays until it is too late; you want both, feeding one place, watched by people who know what to do.
The relationship between the preventive controls and detection is not redundancy, it is depth. Prevention shrinks the set of things that can go wrong; detection catches the ones that do anyway; least privilege and segmentation ensure that what does go wrong stays small while detection and response catch up. A cluster with all three is one where an attacker has to be right at every step and you only have to be right once, which is the inversion of the usual attacker advantage and the entire point of layered security.
Isolating tenants and teams in a shared cluster
Many clusters are not single-team. A platform team runs one cluster and many product teams deploy onto it, and that multi-tenancy raises the stakes on every surface because a weakness now exposes one team’s workloads to another’s. The controls do not change, but their application gets stricter, because the boundary you are defending is internal as well as external.
The namespace is the primary tenancy boundary in Kubernetes, and on a shared cluster it has to be a real boundary rather than an organizational label. That means each tenant namespace gets its own network policy baseline that denies traffic from other namespaces by default, so one team’s pods cannot reach another’s. It means Kubernetes RBAC is scoped per namespace so a team’s developers hold roles only in their own namespace and cannot read another team’s secrets or exec into another team’s pods. It means resource quotas per namespace so one team cannot starve the cluster and cause a denial of service to the others, which is a security property as much as an operational one. And it means admission policy applies uniformly so no team can schedule a privileged pod that, by breaking out to a shared node, would reach every tenant’s workloads on that node.
The hardest part of shared-cluster isolation is the shared node. Pods from different tenants scheduled onto the same node share a kernel, and a container breakout on a shared node crosses the tenancy boundary regardless of how good your namespace network policies are. Where the tenancy boundary must be strong, the answer is to separate tenants at the node-pool level, using node pools and scheduling rules so that a given tenant’s pods only ever land on that tenant’s nodes, which contains a breakout to one tenant’s blast radius. For the strongest isolation requirements, separate clusters per tenant remove the shared kernel entirely, at the cost of more clusters to operate. The decision is a trade between operational overhead and isolation strength, and the right point on it depends on how much you would lose if one tenant’s breach reached another, which is a question only the workload’s owner can answer honestly.
Operating a private cluster without losing velocity
The most common objection to the strongest controls is that they slow the team down, and the private cluster is where that objection bites hardest, because removing the public API endpoint changes how everyone reaches the cluster. The objection is real and the answer is not to abandon the control but to build the access path deliberately so that security and velocity are not at odds.
The team needs three things to work comfortably against a private cluster, and providing them well is the difference between a private cluster people accept and one they route around. First, a reliable network path into the VNet for human operators: a VPN or a bastion host that engineers can rely on, configured so that connecting is a routine, fast step rather than a daily fight. Second, CI and deployment pipelines that run from inside the network, through self-hosted agents in the VNet or a managed path that can reach private endpoints, so deployments do not require a human to tunnel anything. Third, good defaults in tooling, so that the kubeconfig, the DNS resolution to the private endpoint, and the network route are set up once and just work, rather than being a per-person configuration burden that some people get wrong.
When those three are in place, a private cluster costs the team almost nothing in daily friction while removing the single largest piece of public attack surface, and that is the trade you want. When they are absent, people find the friction unbearable and quietly recreate the exposure, granting themselves broad access or standing up a public endpoint “just for now,” which undoes the control through the side door. The lesson generalizes beyond the private cluster: a security control that the team experiences as a wall to climb will be climbed around, and a control built into a smooth path will be used. Investing in the access path is not separate from the security work, it is what makes the security work stick.
The verdict: secure both planes, isolate the network, prove it
Securing an AKS cluster comes down to refusing the defaults’ implied promise that a running cluster is a safe one. The defaults give you a public control plane, a flat network, a permissive in-cluster authorization model, secrets stored as encoded text, and an open door for any image, and each of those is a deliberate choice toward convenience that you have to deliberately reverse toward safety. The two-RBAC-and-isolation rule is the spine of the effort: secure the Azure plane and the Kubernetes plane as the separate access surfaces they are, and turn on pod isolation, because those are the surfaces the defaults leave widest and the ones a real attacker exploits first.
If you do nothing else, do these in order: make the API server private or IP-restricted, apply a default-deny network policy baseline with a chosen engine, scope both RBAC planes to least privilege with Entra integration, move secrets to Key Vault behind workload identity, and enforce a registry allowlist with non-root admission rules. Then make the whole configuration code, send the audit logs somewhere watched, turn on runtime detection, and verify each control from the attacker’s side rather than trusting your own configuration. The cluster that results is not unbreakable, because that cluster does not exist, but it is one where every breach is contained, every access is scoped, and every control is something you can demonstrate. That is what secured actually means, and it is reachable for any team willing to treat the defaults as a starting point rather than a finish line. The VaultBook AKS labs are where to practice each step against a real cluster and watch the controls enforce, which is how the checklist stops being a document and becomes a habit.
Frequently asked questions
How do I secure an AKS cluster?
Treat the cluster as five separate surfaces and harden each. Lock the API server by making the cluster private or applying authorized IP ranges. Isolate the pod network with a default-deny policy baseline enforced by a network policy engine you chose at creation. Scope both Azure RBAC and Kubernetes RBAC to least privilege, ideally through Entra integration with Azure RBAC for Kubernetes authorization. Move credentials out of Kubernetes Secrets into Key Vault behind workload identity. Enforce image admission rules such as a registry allowlist and non-root containers, backed by vulnerability scanning. Then keep nodes and Kubernetes versions patched through automatic upgrade channels, define the whole posture as code, send audit logs to a watched workspace, and verify each control from the attacker’s side rather than trusting your configuration. The organizing rule is to secure both authorization planes and isolate the network, because those are the surfaces the defaults leave widest open.
How do network policies isolate pods?
A network policy is a Kubernetes object that selects pods by label and declares which ingress and egress traffic those pods may send and receive. The key mechanic is that the moment any policy selects a pod for a direction, that pod flips to default-deny for that direction, so only explicitly allowed traffic flows. Policies are additive, so you combine a default-deny baseline with specific allow rules to build the application’s real communication graph edge by edge. Enforcement requires a policy engine, which on AKS means Azure Network Policy Manager, Calico, or the Cilium-based dataplane, chosen at cluster creation because it cannot be retrofitted cleanly. Without an engine, the policy objects exist but enforce nothing. The disciplined pattern is to apply a default-deny for both ingress and egress per namespace, then allow only the connections the application genuinely needs, remembering to allow DNS egress to CoreDNS so name resolution still works.
How do Azure RBAC and Kubernetes RBAC secure access?
They secure two different planes. Azure RBAC governs Azure-level operations on the cluster resource through Azure role assignments: who can retrieve the cluster’s credentials, scale node pools, or manage the AKS resource. Kubernetes RBAC governs in-cluster operations through Roles, ClusterRoles, and their bindings: who can list pods, read secrets, deploy workloads, or exec into containers once they hold a kubeconfig. The two are independent, so a tight Azure assignment grants and protects nothing inside the cluster, and a loose Kubernetes role undermines an otherwise careful Azure setup. The hand-off is precise: Azure RBAC decides whether someone can obtain a working kubeconfig, and Kubernetes RBAC decides what they can do with it. The cleanest configuration is Entra integration with Azure RBAC for Kubernetes authorization, which gives both planes a single identity source and one reviewable assignment surface, so you are not maintaining separate Kubernetes-only accounts alongside your Entra identities.
How do I handle secrets and workload identity securely?
Stop storing real credentials in Kubernetes Secrets, which are base64-encoded values readable by anyone with secret access in the namespace, and keep them in Azure Key Vault instead. Give pods access through workload identity, which federates a Kubernetes service account with a Microsoft Entra identity so a pod authenticates to Key Vault using a short-lived token tied to its service account, with no credential stored in the cluster. When your application needs secrets as files or environment variables, mount them through the Secrets Store CSI driver with the Azure Key Vault provider, which fetches them at pod start using the workload identity. This gives you Key Vault’s audit logging, rotation, versioning, and soft-delete, none of which a Kubernetes Secret offers, and it means a compromised pod can reach only the specific secrets its federated identity was granted, only while its token is valid. Treat the cluster as a consumer of secrets, never a store of them.
Should I use a private cluster for the API server?
For any cluster handling sensitive or regulated workloads, yes. A private cluster removes the public API server endpoint entirely and exposes the API server only through a private endpoint inside your virtual network, so the control plane is reachable from your network and nowhere else. That eliminates internet-facing exposure of the most sensitive surface, the front door to orchestration. The cost is operational: kubectl, CI runners, and administrators must reach the API server from inside the VNet or a connected network, typically through a VPN, a bastion host, ExpressRoute, or self-hosted CI agents. Where a fully private cluster does not fit the operating model, authorized IP ranges are the intermediate control, keeping the public endpoint but accepting connections only from CIDR blocks you list. That shrinks exposure from the whole internet to a few known sources but is weaker than removing the endpoint. Choose the private cluster where you can and use authorized IP ranges where you genuinely cannot.
How do I enforce image scanning and admission policy?
Use two complementary mechanisms. For scanning, Microsoft Defender for Containers inspects images in your Azure Container Registry and the images running in the cluster, reporting known vulnerabilities so you can rebuild affected images. For admission, Azure Policy for AKS deploys the Open Policy Agent Gatekeeper controller and translates Azure Policy definitions into enforcement that rejects non-compliant pods before they schedule. The built-in policies cover the rules most teams want: restrict which registries images may come from, require non-root containers, disallow privileged pods and host namespace sharing, and enforce resource limits. The single highest-value rule is the registry allowlist, which cuts off arbitrary-image risk entirely. Roll any policy out in audit mode first to see what would fail, fix those workloads, then switch the policy to deny. Scanning without enforcement produces reports nobody acts on, and enforcement without scanning blocks on your rules but tells you nothing about vulnerabilities inside compliant images, so run both together.
Is the Kubernetes control plane secure because Microsoft manages it?
Microsoft secures the parts it owns, which is the managed control plane: the API server software, the scheduler, the controller manager, and the etcd datastore, including encrypting etcd at rest. That is real and valuable, but it covers only Microsoft’s side of the shared-responsibility line. Every surface that determines whether your cluster is secure sits on your side: your network policies, your two RBAC planes, where your secrets live, which images admit, and how current your nodes are. The managed control plane does not write a network policy, scope a Kubernetes role, or reject an untrusted image. Treating the platform’s hardening as covering your responsibilities is the exact failure the shared-responsibility model warns against. Trust Microsoft for the control-plane internals and the platform’s own patching, and own everything above that boundary yourself, because the defaults on your side are tuned for getting started, not for safety.
What is the difference between a Kubernetes Secret and Azure Key Vault?
A Kubernetes Secret stores data as base64-encoded values in etcd. Base64 is an encoding, not encryption, so anyone who can read the secret object decodes it instantly, and read access is governed only by Kubernetes RBAC, which on many clusters is looser than people assume. While AKS encrypts etcd at rest on the platform, that protects the disk, not the access path. A Kubernetes Secret offers no read auditing, no rotation, and no recovery once deleted. Azure Key Vault is a dedicated secrets platform with access control through Azure RBAC or access policies, full audit logging of every read, managed rotation, versioning, and soft-delete with purge protection. The secure pattern combines them through workload identity so the cluster consumes secrets from Key Vault without storing them, and the Secrets Store CSI driver bridges Key Vault values into pods as mounted files when applications expect that. The principle is that the cluster should hold nothing it would hurt to lose.
Do I need network policy if I already have network security groups?
Yes, because they operate at different layers and neither substitutes for the other. Network security groups work at the Azure VNet layer, governing traffic at the subnet and network-interface level, including into and out of the node subnet. Network policies work inside the cluster at the pod level, governing which pods may talk to which. An NSG can keep the node subnet from reaching the wider network, but it cannot stop one pod in the cluster from reaching another pod, because that traffic stays inside the cluster’s pod network and never crosses the subnet boundary the NSG sees. A complete posture uses both: the NSG as the coarse VNet-level boundary around the nodes, and network policy as the fine pod-level segmentation that prevents lateral movement among workloads. Relying on the NSG alone leaves the cluster’s internal network flat, which is exactly the lateral-movement risk pod-level policy exists to close.
How do I verify that my RBAC is actually least privilege?
Test it from the perspective of each identity rather than reading the role definitions back to yourself. The kubectl auth can-i command answers directly: ask whether a given subject can perform a given verb on a given resource in a given namespace, and confirm the answer matches what you intended. Run it for the dangerous cases, such as whether a read-only viewer can read secrets or exec into pods, and whether a developer scoped to one namespace can act in another. For a stronger test, acquire or impersonate the identity’s token and attempt the forbidden action, confirming the API server returns a forbidden response. Review every ClusterRoleBinding specifically, because those grant their role in every namespace, and confirm that cluster-admin is bound to no standing identity. Check that application pods either mount no service account token or mount one bound to a minimal role. The test for any role is what an attacker could do if that identity were fully compromised, and whether that blast radius is the smallest that still lets the work happen.
Can a compromised pod take over the whole cluster?
It can if the surfaces around it are at their defaults, and it should be able to do almost nothing if they are hardened. The escalation path runs through the surfaces in this guide. If the pod mounts a service account token bound to a broad role, the attacker uses it against the API server. If the network is flat, the attacker moves laterally to other services. If secrets sit in the namespace as Kubernetes Secrets, the attacker reads them. If the pod runs privileged or with host access, the attacker breaks out to the node and reaches every pod on it. Hardening closes each step: no mounted token or a minimal one, a default-deny network, secrets in Key Vault behind an identity the pod was not granted, and non-root unprivileged containers blocked from host access by admission policy. The design goal is not to make the initial compromise impossible, which no control guarantees, but to ensure that a compromised pod inherits almost nothing useful and goes nowhere.
What network policy engine should I choose for AKS?
The choice is between Azure Network Policy Manager, Calico, and the Cilium-based dataplane provided by Azure CNI powered by Cilium, and you must decide at cluster creation because retrofitting an engine is disruptive. The basic Azure manager and Calico both enforce the standard Kubernetes NetworkPolicy object, selecting by pod label and namespace, which is enough for the default-deny baseline and allow rules most clusters need. The Cilium dataplane adds richer, identity-aware policy and applies enforcement efficiently at scale, which matters for large clusters and for teams that want policy expressed beyond the standard object. For a new cluster without a specific reason to prefer otherwise, the Cilium-based dataplane is a strong default because it folds policy enforcement into the dataplane and scales well, but the more important decision is simply to pick any engine at creation rather than discovering later that your carefully written policies enforce nothing because no engine was ever enabled.
How do I keep an AKS cluster patched without manual work?
Use automatic upgrade channels, which AKS provides for two separate concerns. The node OS patch channel keeps the operating system on your nodes patched with the latest security fixes, replacing node images on a cadence without a manual cycle. The Kubernetes version channels, such as patch, stable, or rapid, move the cluster’s control plane and node pools to newer supported Kubernetes releases so you stay within the supported window where security fixes are issued. Pair the channels with maintenance windows so upgrades land when you expect them rather than mid-incident. The prerequisite is a cluster designed for nodes to be replaced freely: workloads that tolerate rescheduling, no dependence on local node state, and configuration applied declaratively, so an upgrade that recreates nodes does not break the application. A cluster you can upgrade safely is a cluster you can keep secure, because the security hygiene and the operational hygiene are the same discipline.
How do I isolate teams sharing one AKS cluster?
Make the namespace a real boundary rather than a label. Give each team’s namespace its own network policy baseline denying traffic from other namespaces, scope Kubernetes RBAC so a team holds roles only in its own namespace, set resource quotas so one team cannot starve the others, and apply admission policy uniformly so no team can schedule a privileged pod that breaks out to a shared node. The hard limit is the shared node: pods from different teams on the same node share a kernel, so a container breakout crosses the tenancy boundary regardless of namespace policy. Where the boundary must be strong, separate tenants at the node-pool level with scheduling rules so each team’s pods land only on that team’s nodes, which contains a breakout to one team’s blast radius. For the strictest isolation, separate clusters per team remove the shared kernel entirely. The choice trades operational overhead against isolation strength based on what a cross-tenant breach would cost.
What is the most common AKS security mistake?
Securing one access plane and assuming the cluster is locked down. Teams frequently configure Azure RBAC carefully, restricting who can fetch cluster credentials, and leave Kubernetes RBAC at a permissive state where an authenticated identity maps to broad in-cluster permissions, often through a ClusterRoleBinding to a high-privilege role added to unblock someone. The Azure side then looks audited and the real authority lives unconstrained inside the cluster. The closely related mistakes are leaving the network flat with no policy, storing real credentials as plain Kubernetes Secrets, and exposing the API server publicly. All of them share a root cause: there is no single toggle that secures a cluster, and the surfaces are independent, so the security of the whole is the security of the weakest one. The fix is to treat the two RBAC planes as separate, isolate the network, and work every surface rather than hardening one and declaring the cluster done.
Does enabling Microsoft Defender for Containers replace the other controls?
No. Defender for Containers adds vulnerability scanning of images and runtime threat detection that watches for attack behavior such as a suspicious shell, a token grab, or a connection to a malicious endpoint, and it provides ongoing posture assessment against a baseline. That detection layer is essential because prevention has a ceiling and a legitimate pod can be compromised at runtime, but detection assumes the preventive controls are already in place and catches what slips past them. It does not write your network policies, scope your RBAC, move your secrets to Key Vault, or make your cluster private. The correct mental model is depth: preventive controls shrink what can go wrong, least privilege and segmentation keep what does go wrong small, and detection catches the compromises that happen anyway while response contains them. Defender is the detection and assessment layer of that stack, valuable precisely because it complements the preventive surfaces rather than substituting for any of them.
How do I make my AKS security posture auditable and repeatable?
Define the entire posture as code and report on it continuously. Express the cluster itself, with private mode, the network policy engine, Entra integration, the OIDC issuer and workload identity, and the node configuration, in a version-controlled Bicep or Terraform definition so the secure settings are reviewable properties of a committed file rather than flags someone remembered. Keep the Kubernetes RBAC roles, network policies, and service account configurations as manifests applied through a pipeline or GitOps controller so in-cluster state is reviewable and drift is reversible. Assign admission policies through Azure Policy at the subscription or resource-group scope so new clusters inherit the rules. Then send the Kubernetes API audit log and the Azure activity log to a Log Analytics workspace and run Defender posture assessment, so at any moment you can answer from evidence what the configuration is, who has access on both planes, and what anyone did recently. A posture you can answer those questions about is one you can defend in an audit and rebuild identically after a failure.