The Approval Button Is Not Governance

Human-in-the-loop only works when humans are reviewing the right object: not a confident summary, but the proposed change and its consequences.

Jun 18, 2026

Every AI system has a moment where it stops being a toy and starts becoming infrastructure.

It is not the first prompt. It is not the first impressive answer. It is not even the first time the model writes code. It is the moment someone accepts the output and lets it change something real.

A pull request gets merged. A Terraform change gets applied. A customer email gets sent. A dashboard query becomes the source of a business decision. A ticket is closed. A permission changes. A workflow moves to the next state.

That is the boundary.

Before that moment, AI is mostly producing suggestions. After that moment, AI is participating in operations.

And the way we evaluate AI today is still mostly built for the world before that boundary. We ask whether the model answered correctly, whether it hallucinated, whether the response was grounded, whether it followed the prompt, whether the tone was right, and whether the generated output matched some expected answer.

Those are useful questions. But they are chatbot-era questions.

They assume the object being evaluated is an answer.

Production AI is moving toward a different object: a change.

And a change cannot be evaluated like an answer.

An answer can be judged mostly in isolation. A change cannot. A change has context, dependencies, owners, blast radius, cost, timing, permissions, rollback complexity, and second-order effects. It enters a system that already has state, history, constraints, and people depending on it.

A wrong AI answer is annoying.

A wrong AI-generated change is operational debt.

A wrong AI-generated change that passes review is an incident waiting for a timestamp.

That is why I think the next serious layer of AI architecture is not another prompt framework, another benchmark, or another agent demo.

AI needs a Change Control Plane.

The risk is not the response. The risk is the mutation.

We often talk about AI risk as if the danger lives entirely inside the model.

The model hallucinated. The model misunderstood the request. The model used stale context. The model generated insecure code. The model sounded too confident.

All of that matters. But in production systems, the real risk does not fully enter the organization when the model generates text.

It enters when that text becomes a mutation.

The merge is the risk boundary. The apply is the risk boundary. The send button is the risk boundary. The state transition is the risk boundary.

This is where many AI systems are still pointed at the wrong object. They evaluate the response, but the organization absorbs the consequence of the change.

A coding assistant does not merely generate code. It proposes a behavior change. A DevOps assistant does not merely generate Terraform. It proposes infrastructure mutation. A data assistant does not merely generate SQL. It proposes a new interpretation of business reality. A support assistant does not merely draft a reply. It proposes a promise to a customer.

Once AI enters that layer, answer-quality evaluation becomes necessary but insufficient.

The question is no longer only:

Was the output good?

The question becomes:

What happens if we accept this change?

That is the shift.

The model’s response is only the visible surface. The deeper question is what the response will do once it crosses into the systems people rely on.

That crossing is where production AI begins.

Terraform already taught us this lesson.

Infrastructure engineers already understand this pattern.

You do not apply Terraform just because the configuration looks reasonable. You inspect the plan.

That is the whole point of the plan. It gives you a preview of consequences before you mutate the environment. It shows the difference between declared intent and current state. It tells you what Terraform intends to create, update, or destroy before it touches anything.

A Terraform plan is not just an explanation.

It is a preview of change.

That distinction matters because AI systems are very good at explanations. They can always produce one. They can explain why the code is correct, why the architecture is sound, why the query works, why the email is appropriate, and why the action is safe.

But an explanation is not the same as a plan.

An explanation says, “Here is why this might be right.”

A plan says, “Here is what will change if you accept this.”

That is what AI-generated work needs.

If an AI system proposes a code change, I do not only want a summary of the code. I want to know what behavior changes, what tests ran, what dependencies are touched, and who owns the affected path.

If an AI system proposes an infrastructure change, I do not only want valid HCL. I want to know what resources will be created, updated, destroyed, exposed, renamed, or made more expensive.

If an AI system proposes a customer response, I do not only want a polite tone. I want to know what commitment we are making, whether the customer is eligible for it, and whether the reply should have been escalated.

If an AI system proposes a data query, I do not only want syntactically valid SQL. I want to know which metric it changes, which dashboard it feeds, which assumptions it makes, and which decision might depend on it.

The output is not the answer.

The output is the proposed change.

And proposed changes need plans.

The missing artifact is the AI Change Plan.

Before accepting AI-generated work, I want a review object that answers a simple question:

What will happen if we apply this?

That object is an AI Change Plan.

It does not have to be a literal YAML file. It could live inside a pull request, a ticket, an approval request, a deployment workflow, a support console, or an internal automation system. The format matters less than the discipline.

The point is to stop treating AI output as a self-contained response and start treating it as a proposed change against a real system.

Imagine a platform engineer asks an AI system to add a Redis cache to the staging payments service.

A weak AI workflow might produce a Terraform file, a short summary, and a cheerful note saying the change is ready. A slightly better workflow might run formatting and validation checks, then generate a PR description.

But a production-grade workflow should produce something closer to a change plan.

It should say that the request is scoped to staging, not production. It should show that the AI inspected the current Terraform state and did not find an existing Redis cache for the staging payments service. It should describe the proposed resource, the security group update, and the variables being added. It should estimate the cost increase. It should explain that the blast radius is limited to the internal staging environment. It should call out the security impact of a new network access path. It should attach the Terraform plan, policy results, and cost estimate. It should say what was not verified, such as whether the inferred eviction policy is correct or whether load testing has been done. It should recommend opening a pull request, but not applying the change until a platform owner approves it. It should also describe how to roll the change back.

That is not just a better explanation.

That is a different control surface.

It gives the reviewer something real to inspect. It gives a policy engine something structured to evaluate. It gives the organization a record of what was known at the time. It gives future incident review a trail to follow.

This is what is missing from many AI workflows today.

They produce output. They produce summaries. They produce confidence. They produce explanations. But they do not consistently produce a reviewable change plan.

And without a change plan, the human reviewer is often approving a story about the work instead of the work itself.

The approval button is not governance.

One of the most comforting phrases in AI product design is “human in the loop.”

It sounds safe. It sounds responsible. It sounds like we have solved the problem.

But human approval is only useful when the human is shown the right object.

If the human is approving a summary, they are approving the AI’s story. If the human is approving a diff without context, they are approving incomplete work. If the human is approving an action without policy results, they are accepting hidden risk. If the human is approving a workflow without rollback, they are signing up for cleanup.

The approval button is not governance.

The review object is governance.

A tired engineer clicking approve on an AI-generated summary is not a control plane. It is liability with a nice UI.

This matters because AI systems can be very persuasive. A model can produce a confident explanation for a weak change. It can make incomplete work sound complete. It can describe a risk as if it has been handled when it has only been mentioned.

That is why I do not want the AI to merely explain itself.

I want evidence.

There is a difference.

An explanation is narrative. Evidence is operational.

An explanation says the change is safe. Evidence shows which checks ran, which state was inspected, which policy passed, which diff was generated, which cost was estimated, which uncertainty remains, and which rollback path exists.

Production AI systems need explanations, but they should not rely on explanations.

They need evidence before execution.

The Change Control Plane sits between AI output and system mutation.

The AI Change Plan is the artifact.

The AI Change Control Plane is the system that produces, evaluates, routes, approves, blocks, executes, records, and learns from those plans.

It is the layer between AI-generated work and real-world mutation.

The model should not be the control plane. The agent should not be the control plane. The chat interface should not be the control plane.

The control plane is the part of the system that stays boring on purpose.

It does not care how confident the model sounds. It does not care how elegant the generated code looks. It does not care how impressive the demo was. It asks the same production questions every time.

What is the AI trying to change? What current state did it use? What systems are affected? What is the blast radius? What evidence supports the change? What policy applies? Who owns the affected area? Can this be reversed? Should it be auto-accepted, reviewed, escalated, revised, or blocked?

That is the job.

The control plane begins by bounding intent. This matters because AI systems are good at expanding work. A request to “clean this up” can quietly become a refactor. A request to “fix the alert” can become a change to monitoring policy. A request to “make this faster” can become a caching layer, a database index, and a new operational burden. The first job of the control plane is to say: this is the change being requested, and this is the scope.

Then it captures state. A change is only meaningful relative to the current world. For code, that might mean the repository, dependencies, recent commits, and ownership. For infrastructure, it might mean Terraform state, environment, policy bundle, cost baseline, and cloud inventory. For data, it might mean schema, lineage, freshness, and downstream dashboards. For support, it might mean account status, customer tier, refund policy, and escalation history.

Then the AI proposes work. That work could be code, configuration, a message, a query, a runbook, a ticket update, or an action plan. But before that work is accepted, the control plane turns it into a reviewable change plan.

Then comes impact analysis. This is where the system asks what could happen if the change goes through. Is this local or production? Is it reversible? Does it touch customer-facing behavior? Does it increase cost? Does it change permissions? Does it affect compliance-sensitive language? Does it alter a metric that leadership uses?

Then comes policy evaluation. Some rules can be encoded. Some require humans. A production destroy should be blocked without explicit approval. A pricing claim should route to the right team. A query that changes revenue metrics should require data owner review. A staging-only infrastructure change below a cost threshold might be allowed to move faster.

Then comes decision routing. Not every AI-generated change deserves the same process. Low-risk, reversible changes should move quickly. High-risk, ambiguous, or irreversible changes should face more scrutiny. Dangerous changes should be blocked before they waste human attention.

Then comes execution. The execution gateway should be strict. It should allow only the action that was approved, under the scope that was approved, with the evidence that was attached. If the state changes, the approval should not silently carry forward.

Finally, the control plane records what happened. What did the AI propose? What did the human change? What was accepted? What was rejected? What was rolled back? Which policies fired? Which systems were affected? What happened afterward?

That record matters because AI adoption without a change ledger becomes invisible operational drift.

The organization moves faster, but it becomes harder to explain why anything changed.

Most teams are adopting powerful AI tools before they have the operating model for them.

This is where the timing matters.

AI tools are moving quickly from suggestion to action. At first, the AI only answered questions. Then it drafted artifacts. Then it opened pull requests. Then it called tools. Then it updated tickets. Then it queried systems. Then it started coordinating across workflows.

The capability curve is moving from “assist me” to “do this for me.”

But most organizational governance is still designed for the earlier phase.

Many teams can review an AI draft. They can inspect a PR. They can ask the model to explain its work. They can put a human approval step in front of an action.

But they cannot consistently answer the more important questions.

What changed? Why did it change? What state was the AI looking at? What evidence supported the work? Who approved it? Which policy allowed it? What was the blast radius? Was it reversible? What happened after it shipped?

That gap is where failures will happen.

Not because AI is useless. Not because agents are inherently reckless. Not because automation is bad.

Failures will happen because the operating model around AI-generated work is immature.

The tool can be powerful while the governance is primitive.

That is the uncomfortable part.

A company may buy an AI system that can coordinate multi-step workflows across tools, but still review its work with a summary box and an approve button. That mismatch will not hold.

The more capable the AI system becomes, the more important the control plane becomes.

This is bigger than agents.

I have been writing a lot about agents recently, but I think the deeper pattern is bigger than agents.

Agents make the problem obvious because they collapse the distance between suggestion and action. When an AI system can call tools, open pull requests, update tickets, trigger workflows, query databases, or modify cloud resources, it becomes very clear that response-level evaluation is not enough.

But this problem exists even without agents.

A human can copy AI-generated code into a repo. A human can paste an AI-generated SQL query into a dashboard. A human can send an AI-drafted customer email. A human can follow an AI-generated runbook during an incident. A human can accept an AI-generated policy update.

In all of those cases, the AI may not directly execute the change.

But the AI still produced the work that changed the system.

So the real distinction is not agent versus non-agent.

The real distinction is answer versus change.

Once AI-generated output becomes a change, it needs change management.

That is why I think “AI Change Control Plane” is a better frame than “agent governance.” Agent governance is one part of it. But the larger category is AI-generated work entering operational systems.

The common pattern is not autonomy.

The common pattern is mutation.

The future PR will carry more than a diff.

Software gives us the easiest place to see this coming.

The pull request is becoming the natural container for AI-generated work. That makes sense. It already has review, discussion, tests, ownership, history, and merge control.

But the PR itself has to evolve.

A human-authored PR often carries a lot of invisible context. The reviewer knows the teammate. They may know the project history. They may have discussed the ticket earlier. They may understand why a shortcut was taken. They may know what was intentionally left out.

AI-generated work does not come with that same tacit context.

So the PR has to make more of it explicit.

An AI-generated PR should not only contain a diff and a cheerful summary. It should carry the original intent. It should show what state the AI inspected. It should explain what assumptions were made. It should attach tests and policy results. It should identify uncertainty. It should show blast radius. It should name the required reviewer. It should include a rollback path. It should preserve the trace of how the change was produced.

This is not because AI deserves a special process forever.

It is because AI-generated work often arrives without the social and historical context that human teams use to review work safely.

The control plane’s job is to make that missing context visible.

A good AI PR should make the reviewer’s job easier, not harder.

It should not say:

Trust me.

It should say:

Here is the change. Here is the state I used. Here is the evidence. Here is what I could not verify. Here is the risk. Here is how to roll it back.

That is a reviewable object.

A control plane should make the safe path faster.

The predictable objection is that this sounds heavy.

And implemented badly, it would be.

Nobody wants a committee in front of every AI-generated typo fix. Nobody wants a 12-step approval flow for a documentation draft. Nobody wants to turn AI adoption into compliance theater.

But that is not the point of a control plane.

The point is not to slow every change down. The point is to route changes based on risk.

Low-risk, reversible changes should move faster. High-risk, ambiguous changes should get the right review. Dangerous changes should be blocked before they waste human attention.

A good CI/CD system does not exist to slow developers down. It exists so teams can ship more changes with less chaos.

A good deployment pipeline does not block everything. It automates the checks that should be automated and escalates the decisions that need human judgment.

A good AI Change Control Plane should do the same.

It should allow a documentation cleanup to move quickly. It should let a staging-only, low-cost, reversible infrastructure proposal become a PR with the right evidence attached. It should route a production permission change to the right owner. It should block a destructive action that falls outside the approved scope. It should ask for fresh state when the underlying environment has changed. It should refuse to execute work whose approval was based on stale evidence.

That is not bureaucracy.

That is how you turn AI from a clever assistant into reliable operational leverage.

AI adoption needs a system of record for change.

There is another reason this matters.

As organizations adopt AI across teams, AI-generated work starts to appear everywhere: a coding assistant in engineering, a chatbot in support, a sales email assistant, a data analysis copilot, an incident response helper, an internal operations agent, a workflow automation tool, a product manager using AI to draft specs, a platform team using AI to generate infrastructure changes.

Each tool has its own interface. Each has its own logs. Each has its own notion of approval. Each produces work that may eventually enter the business.

At small scale, this feels fine.

At organizational scale, it becomes fragmentation.

Where do we see what AI proposed? Where do we see what humans accepted? Where do we see what was rejected? Where do we see what was rolled back? Where do we see which policies were violated? Where do we see which teams are relying on AI-generated work? Where do we see whether AI is improving throughput or creating rework?

Without a system of record, AI adoption becomes hard to reason about.

The company may feel more productive, but the change history becomes harder to inspect.

That is a dangerous trade.

A Change Control Plane gives AI-generated work a place to live. It gives the organization a way to see not just usage, but consequence.

Not just how many prompts were sent. Not just how many suggestions were accepted. But what changed, how much human correction was needed, what escaped review, what got rolled back, which policies failed, and which workflows became safer or more fragile.

Those are the metrics that will matter.

The next phase of AI evaluation will look more like release engineering.

The first phase of generative AI evaluation was about model behavior. Could it answer correctly? Could it avoid hallucination? Could it follow instructions? Could it retrieve the right context? Could it produce high-quality output?

The next phase will be about system behavior.

Can the system safely accept AI-generated work? Can it detect when the AI is operating on stale state? Can it route changes to the right reviewer? Can it block out-of-scope actions? Can it attach evidence before approval? Can it keep execution within the approved boundary? Can it roll back? Can it learn from human corrections?

That is a different kind of evaluation.

It looks less like grading an essay and more like reviewing a production change.

It borrows from CI/CD. It borrows from Terraform plans. It borrows from policy-as-code. It borrows from release engineering. It borrows from incident review. It borrows from audit logs and deployment safety.

This is where production AI is heading.

Not because every AI system will become a fully autonomous agent.

But because more AI output will become work. And work changes systems.

The real question is no longer “Did the AI answer correctly?”

We spent the first phase of generative AI asking whether the model could produce the right answer.

That was the right question for the chatbot era.

It is not enough for the production era.

Production AI systems do not only generate text. They generate work. And work changes things.

The teams that win will not be the ones whose AI sounds the most confident. Confidence is cheap. Explanations are cheap. Polished summaries are cheap.

The winning teams will be the ones whose AI-generated work can be planned, reviewed, approved, traced, rolled back, and improved.

They will know what changed. They will know why it changed. They will know what state the AI used. They will know what evidence supported the change. They will know who approved it. They will know which policy allowed it. They will know what happened after.

That is the difference between using AI as a feature and operating AI as infrastructure.

The real question is no longer:

Did the AI give the right answer?

The real question is:

Should we accept the change?

And that is why AI needs a Change Control Plane.

Discussion about this post

Ready for more?