how to secure ai agents in kubernetes production
the short answer
to secure ai agents in kubernetes production, give each agent its own least-privilege serviceaccount and rbac role, route every cluster-mutating call through an in-line approval proxy, and require a human to confirm destructive verbs (delete, drain, scale-to-zero, patch on production namespaces) before they reach the api server. the agent keeps its speed on read-only and safe writes; the dangerous 1% gets caught.
67%
Red Hat State of Kubernetes Security 2024 — 67% of respondents delayed or slowed a deployment over a security concern
ai agents are increasingly trusted to operate clusters: triaging crashlooping pods, scaling deployments, rolling restarts, cleaning up stale resources. that's genuinely useful. the problem is that a single hallucinated command — kubectl delete namespace payments, or a scale --replicas=0 on the wrong deployment — is indistinguishable from a legitimate one until it has already executed. red hat's state of kubernetes security 2024 found that 67% of organizations had delayed or slowed a deployment because of a security concern, which tells you how nervous teams already are about what touches the cluster.
1. scope rbac to the agent, not to a human
start from zero. create a dedicated serviceaccount per agent and bind it to a role that grants only the verbs and resources it actually needs. an agent that reads pod logs and restarts deployments does not need delete on secrets, namespaces, or persistentvolumes. avoid cluster-admin entirely. if the agent works across namespaces, prefer several narrow rolebindings over one broad clusterrolebinding. this is the same access-control discipline we cover in our guide to ai agent access control for devops and sre teams — kubernetes just makes the blast radius bigger.
2. intercept destructive verbs in-line
rbac decides what is possible; it can't decide what is wise in this specific moment. that's where an interception layer earns its place. point the agent's kube client (or the tool that wraps kubectl) at an agent.shield proxy instead of the raw api server. safe, read-only traffic forwards untouched. anything matching a destructive policy is held and surfaced for review.
- delete on deployments, statefulsets, namespaces, pvcs, and secrets
- scale to zero or drastic replica reductions on production namespaces
- drain or cordon on nodes
- patch or apply that changes resource limits or image tags in prod
- anything touching a namespace you've labelled protected
3. keep a human in the loop for the irreversible 1%
the goal is not to slow the agent down on everything — it's to put a person on the one decision that can't be undone. when a held request lands in the review queue, the reviewer sees the exact verb, resource, namespace, and payload, approves to forward it to the real api server, or denies to stop it cold. this is the same pattern we describe in human-in-the-loop security for ai operations, applied to the cluster.
rbac says what an agent could do. an approval proxy says what it may do, right now, with a name attached to the decision.
4. log every cluster action
kubernetes audit logs are good but noisy. pair them with an agent-level record that ties each intercepted call to the policy it matched and the human who approved or denied it. when an incident review asks how a namespace disappeared, you want one timeline, not a grep across api-server logs. see logging and auditing ai agent actions in production for how to structure that trail.
frequently asked questions
does this require changing my agent's code?+
no. you point the agent's kubernetes client or kubectl wrapper at the proxy url instead of the api server. auth, payloads, and verbs pass through unchanged — only destructive calls are held for review.
won't an approval step slow down incident response?+
only on destructive actions. read-only triage, log pulls, and safe restarts forward instantly. the approval gate applies to the handful of irreversible verbs — delete, drain, scale-to-zero — where a five-second pause is cheaper than an outage.
is rbac alone not enough?+
rbac is necessary but static. it grants a capability for as long as the binding exists; it can't judge whether deleting this namespace right now is correct. an in-line approval layer adds that just-in-time judgment on top of rbac.
what about agents using the kubernetes api directly instead of kubectl?+
same approach. the proxy sits in front of the api server endpoint the agent calls. it inspects method, path, and body, so a raw DELETE to /apis/apps/v1/namespaces/prod/deployments/x is caught just like a kubectl delete.
related reading
ai agent access control for devops and sre teams
build access control for ai agents the way sre teams build it for services: least privilege, short-lived scopes, and a human gate on irreversible actions.
human-in-the-loop security for ai operations
what human-in-the-loop security means for ai operations, when to require a human gate, and how to add one without killing the speed that makes agents useful.
logging and auditing ai agent actions in production
how to log and audit ai agent actions in production so incident reviews take minutes, not days: capture every call, decision, and identity in one trustworthy trail.
get started with agent.shield
put a human back in the loop for the actions that can't be undone. no agent rewrite — just a url your agent already knows how to call.