How I build continuous evaluation & safety gates for agentic apps

A practical guide with CI + Grafana assets

Jan 13, 2026

If you’re building agentic systems that plan, call tools, and take real actions, you can’t treat safety as a checklist.

You need automation that stops bad behavior before users see it.

In this post, I’ll walk through exactly how I design continuous evaluation and safety gates for agentic apps — using a customer support triage agent as the concrete example.

This is not theory.
It’s a runnable architecture with CI gates, canaries, and dashboards.

Save this if you’re short on time

Agentic apps must be treated like safety-critical distributed systems
Continuous evaluation belongs in CI, not in ad-hoc benchmarks
Canary deployments decide whether a model change survives
Observability (Grafana) is the real safety control plane
Below: a full reference setup you can clone and adapt

Why customer support triage is the right example

I intentionally chose customer support triage because it sits at the intersection of:

high volume
real business impact
sensitive data (PII, billing context)

A regression here doesn’t just lower quality — it:

misroutes customers
overloads humans
or leaks data silently

If your safety system works for triage, it will work almost anywhere.

The mental model (how the system actually works)

At a high level, the flow looks like this:

User message
 → Triage agent (LLM + tools)
   → Evaluator (LLM-as-judge + heuristics)
     → Metrics (Prometheus)
       → CI gates + Canary controller
         → Grafana dashboard
           → Human review (only when needed)

📌 Repo reference
customer-support-triage/diagrams/triage-eval-architecture.md

This diagram is important:
evaluation and safety are not separate systems, they are part of the runtime.

Step 1: Start with risk, not prompts

Before writing a single test, I list what the agent can actually do.

For a triage agent, that usually means:

• Classifying intent
• Routing tickets
• Sending automated replies
• Touching billing or PII-related flows

Then I assign risk levels:

• Low risk actions → gated in CI
• Medium risk actions → gated in canary
• High risk actions → require human-in-the-loop

This single step decides:

• What blocks a PR
• What triggers rollback
• What must involve a human

Without this, “safety” stays subjective.

Step 2: Build a layered evaluation suite

One evaluation type is never enough.

Here’s what I always include:

• Unit evaluations
Single-turn checks for schema, intent, and routing correctness
(stored in evals/unit_tests.json)

• Regression scenarios
Multi-turn conversations that validate end-to-end behavior
(stored in evals/regression_scenarios.json)

• Adversarial tests
Jailbreaks, prompt injection, PII leakage attempts
(stored in evals/adversarial_tests.json)

• Statistical runs
Large batches to measure variance and tail failures
(executed by runner/run_eval.py)

Important detail:

These are data files, not code.

That means product managers, QA, and support teams can all contribute — without touching model logic.

Step 3: Continuous evaluation in CI (non-negotiable)

Every change to:

• Prompts
• Routing logic
• Model versions

must pass evaluation before it merges.

Here’s the CI gate (simplified):

name: Continuous Agent Evaluation
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r runner/requirements.txt
      - run: python runner/run_eval.py

📌 Repo location:
customer-support-triage/ci/continuous-eval.yml

If success or safety thresholds fail, the PR fails.

No exceptions.
No “we’ll monitor in prod.”

Step 4: Canary deployments are your real safety net

CI catches obvious problems.
Canaries catch real-world ones.

This is the exact logic I rely on conceptually:

IF canary safety events exceed 0.1%
OR success rate drops by more than 2% vs baseline
OR P95 latency is 50% worse than baseline
THEN
  suspend canary
  rollback deployment
ELSE
  promote

Why this works:

• Safety uses absolute thresholds
• Quality uses relative deltas
• Rollback is automatic and fast

This logic lives in your deployment layer (Argo Rollouts, Flagger, or equivalent).

Step 5: The dashboard that actually matters

If you can’t see agent health, you don’t have safety — you have hope.

*Canary vs baseline comparison for a customer support triage agent. Safety events and latency regressions trigger automatic rollback.*

This dashboard shows, at a glance:

• Canary success rate dipping below baseline
• Safety events crossing thresholds
• Human escalation spikes
• Latency regressions

This is the moment where automation saves you.

📌 Dashboard JSON lives at:
customer-support-triage/grafana/triage-dashboard.json

Step 6: Humans where they add value (HITL)

Humans should not babysit agents.

They should guard high-impact edges.

My default rules:

• Escalate when evaluator score < 0.7
• Always escalate billing or PII-touching actions
• Randomly sample 1–5% of traffic for audit

Human feedback feeds directly back into:

• Regression scenarios
• Adversarial tests

📌 Evaluator logic lives in:
runner/run_eval.py

Step 7: Red-team continuously

Automation catches regressions.
Red-teaming finds blind spots.

My cadence:

• Automated adversarial runs: weekly
• Human red-team: monthly
• After any safety incident: freeze promotions and add regressions

📌 Attack cases live in:
evals/adversarial_tests.json

The real takeaway

Most teams talk about “guardrails.”

Very few can answer this question:

What exactly stops a bad agent deployment at 2 a.m.?

This system does.

With CI gates, canaries, metrics, dashboards, and humans only where necessary.

Continuous evaluation is not optional for agentic AI.
It is the control plane.

Companion repository referenced in this post

Playbook repo

Discussion about this post

Ready for more?