Setting Guardrail Metrics to Auto-Halt Experiments
This how-to is part of Experimentation & A/B Testing Guardrails. It solves a specific failure mode: an experiment’s treatment variant is actively harming a core metric — error rate climbing, latency spiking, checkout abandonment rising — but the experiment runs uninterrupted until someone manually notices. By the time the post-mortem runs, tens of thousands of users have experienced a degraded product.
The fix is a guardrail monitor: a lightweight comparison loop that checks treatment metrics against control metrics on a short interval and disables the experiment flag automatically when a threshold is breached. This how-to covers defining the thresholds, wiring the comparison, triggering an automatic halt, and alerting the team. It does not cover the statistical framework for choosing success metrics or sample sizes — those are in the experimentation guardrails overview.
Prerequisites
namespace.service.featurekey schema and uses sticky bucketing for consistent variant attribution
Step-by-Step Procedure
Step 1 — Define guardrail metrics and thresholds
Before the experiment starts, write down every metric you are not allowed to harm and the maximum acceptable degradation. Commit these as structured metadata on the flag object so they are in the audit trail and are visible to anyone who reads the flag definition.
# flagd flag definition with guardrail metadata
flags:
checkout.payments.express-pay:
state: ENABLED
variants:
"treatment": true
"control": false
defaultVariant: "control"
targeting:
fractionalEvaluation:
- { "var": "targetingKey" }
- ["treatment", 50]
- ["control", 50]
metadata:
owner: "payments-team"
expiry: "2026-07-20"
experiment_type: "ab"
guardrails:
- metric: "error_rate"
comparison: "relative_increase"
threshold: 0.01 # halt if treatment error rate > control + 1%
- metric: "p95_latency_ms"
comparison: "relative_increase"
threshold: 0.10 # halt if p95 > control + 10%
- metric: "cart_abandonment_rate"
comparison: "relative_increase"
threshold: 0.02 # halt if abandonment > control + 2pp
Do not set thresholds tighter than your normal day-to-day metric variance. If your error rate fluctuates ±0.5% without any experiment, a 0.3% threshold will fire constantly on noise. Baseline your metrics for at least a week before setting guardrail values.
Step 2 — Wire a monitor that compares variant metrics
Build a monitor that queries per-variant aggregates on a short, fixed interval. Two minutes is a reasonable starting point — short enough to catch a fast-moving regression, long enough for the aggregation window to smooth out single-request spikes.
# guardrail_monitor.py
import asyncio, httpx, os, sys, json, math
FLAG_KEY = "checkout.payments.express-pay"
FLAG_API = os.environ["FLAG_API_URL"]
FLAG_TOKEN = os.environ["FLAG_API_TOKEN"]
METRICS_API = os.environ["METRICS_API_URL"]
ALERT_WEBHOOK = os.environ["ALERT_WEBHOOK_URL"]
GUARDRAILS = [
{"metric": "error_rate", "threshold": 0.01},
{"metric": "p95_latency_ms", "threshold": 0.10},
{"metric": "cart_abandonment_rate","threshold": 0.02},
]
MIN_OBSERVATIONS = 100 # do not fire before this many events per variant per metric
async def fetch_metric(client: httpx.AsyncClient, metric: str, variant: str) -> dict:
resp = await client.get(
f"{METRICS_API}/query",
params={"flag": FLAG_KEY, "variant": variant, "metric": metric, "window": "5m"},
)
resp.raise_for_status()
return resp.json() # {"value": float, "count": int}
async def halt_experiment(client: httpx.AsyncClient, reason: str) -> None:
await client.patch(
f"{FLAG_API}/v1/flags/{FLAG_KEY}",
json={"defaultVariant": "control", "state": "DISABLED"},
headers={"Authorization": f"Bearer {FLAG_TOKEN}"},
)
await client.post(
ALERT_WEBHOOK,
json={"text": f"AUTO-HALT: {FLAG_KEY} — {reason}"},
)
print(f"AUTO-HALT triggered: {reason}", file=sys.stderr)
async def run_once() -> None:
async with httpx.AsyncClient(timeout=10) as client:
for g in GUARDRAILS:
control = await fetch_metric(client, g["metric"], "control")
treatment = await fetch_metric(client, g["metric"], "treatment")
if control["count"] < MIN_OBSERVATIONS or treatment["count"] < MIN_OBSERVATIONS:
continue # too few observations — skip this round
if control["value"] == 0:
continue # avoid division-by-zero on zero baseline
relative_delta = (treatment["value"] - control["value"]) / control["value"]
if relative_delta > g["threshold"]:
reason = (
f"{g['metric']}: treatment={treatment['value']:.4f} "
f"control={control['value']:.4f} "
f"delta={relative_delta:.1%} > threshold={g['threshold']:.1%}"
)
await halt_experiment(client, reason)
return # halt fires once; subsequent checks are moot
if __name__ == "__main__":
asyncio.run(run_once())
Run this as a cron job every two minutes, or embed it in your existing monitoring service as a recurring task.
Pitfall: comparing treatment to a time-shifted baseline (yesterday’s control numbers) rather than the concurrent control arm confounds the experiment with time-of-day variation. Always compare treatment and control groups measured in the same time window.
Step 3 — Trigger an automatic flag halt on breach
The halt in the monitor above disables the flag and forces the control variant. Verify that your flag management API accepts a PATCH that sets both state: DISABLED and defaultVariant: "control" atomically. If not, use a force-variant API call instead — the emergency kill-switch runbook shows the pattern for forcing a specific variant through the control plane without disabling the flag object.
# Verify the halt is wired correctly — dry-run without a live experiment
curl -s -X PATCH "${FLAG_API}/v1/flags/checkout.payments.express-pay" \
-H "Authorization: Bearer ${FLAG_API_TOKEN}" \
-H "Content-Type: application/json" \
-d '{"state": "DISABLED", "defaultVariant": "control"}' \
| jq '{state, defaultVariant}'
# expect: {"state": "DISABLED", "defaultVariant": "control"}
Step 4 — Alert with enough context to act immediately
The alert must carry the offending metric, both raw values, the relative delta, and a link to the flag definition. An alert that says “experiment halted” without these values forces the on-call engineer to reconstruct the context from scratch.
# Alert payload — include all values needed for immediate action
alert_body = {
"text": f":rotating_light: AUTO-HALT: `{FLAG_KEY}`",
"attachments": [{
"color": "danger",
"fields": [
{"title": "Guardrail metric", "value": g["metric"], "short": True},
{"title": "Treatment value", "value": f"{treatment['value']:.4f}","short": True},
{"title": "Control value", "value": f"{control['value']:.4f}", "short": True},
{"title": "Relative delta", "value": f"{relative_delta:.1%}", "short": True},
{"title": "Threshold", "value": f"{g['threshold']:.1%}", "short": True},
{"title": "Flag key", "value": FLAG_KEY, "short": False},
],
}],
}
Verification Step
Simulate a regression and confirm the auto-halt fires before trusting it in production:
# 1. Inject a synthetic error_rate spike for the treatment variant into your metrics store
curl -s -X POST "${METRICS_API}/inject" \
-H 'Content-Type: application/json' \
-d '{"flag":"checkout.payments.express-pay","variant":"treatment","metric":"error_rate","value":0.04,"count":500}'
# 2. Run the monitor once (not as a cron; invoke directly)
python guardrail_monitor.py
# 3. Confirm the flag is now disabled
flagctl get checkout.payments.express-pay --env staging -o json | jq '{state, defaultVariant}'
# expect: {"state":"DISABLED","defaultVariant":"control"}
# 4. Confirm the alert webhook received the payload
curl -s "${ALERT_WEBHOOK}/last" | jq '.text'
# expect the AUTO-HALT message with metric values
# 5. Restore for the real experiment
flagctl set checkout.payments.express-pay --state ENABLED --env staging
Gotchas & Edge Cases
- The halt fires during a metric blip, not a real regression: if your metrics store has a delayed ingestion pipeline, a query at T+0 may return stale data that appears to spike. Add a confirmation window — require the threshold to be exceeded for two consecutive monitor runs before triggering a halt. This doubles detection latency but eliminates false positives from ingestion lag.
- Concurrent experiments can cross-contaminate guardrail metrics: if a second experiment is running on the same service at the same time, its treatment arm may be causing the regression your monitor is attributing to your experiment’s treatment. Segment guardrail metrics by all active flag keys before drawing conclusions.
- Progressive delivery](/feature-flag-architecture-lifecycle-management/implementing-progressive-delivery-workflows/) ramps and experiments need separate halt logic: a ramp should halt and hold at the current percentage; an experiment should halt and revert to control. Use the flag metadata
typefield (rolloutvsexperiment) to differentiate halt behaviour in the same monitor.
Troubleshooting & FAQ
The monitor halted the experiment but I cannot see which metric triggered it — where do I look?
The reason string from the halt call should appear in both the alert payload and the flag’s audit trail. If your flag management API does not accept a reason field on the PATCH call, write the reason to a structured log event alongside the halt at the time the monitor fires. See building audit trails for compliance for the full audit log pattern.
How often should the monitor run?
Every 1–5 minutes is the practical range. Below 1 minute, you risk firing on single-interval metric noise and your flag API receives halt calls before the aggregation window is stable. Above 5 minutes, a fast-moving regression (a cache poisoning bug, a sudden upstream timeout cascade) can affect a significant user population before the halt fires.
Can I set different guardrail thresholds for the first hour versus steady state?
Yes. Add a warmup_minutes field to the guardrail metadata and skip threshold checks in the monitor until now - experiment_start > warmup_minutes. The first minutes of an experiment typically have high variance from cache warming and first-visit effects; a fixed warmup window avoids premature halts during that period.