Setting Guardrail Metrics to Auto-Halt Experiments

Q: How often should the monitor run?

Every 1 to 5 minutes is the practical range. Below 1 minute you risk firing on metric noise before the aggregation window stabilizes. Above 5 minutes a fast-moving regression can affect a significant user population before the halt fires.

Q: Can I set different guardrail thresholds for the first hour versus steady state?

Yes. Add a warmup_minutes field to the guardrail metadata and skip threshold checks until that period has elapsed. The first minutes of an experiment have high variance from cache warming and first-visit effects.

This how-to is part of Experimentation & A/B Testing Guardrails. It solves a specific failure mode: an experiment’s treatment variant is actively harming a core metric — error rate climbing, latency spiking, checkout abandonment rising — but the experiment runs uninterrupted until someone manually notices. By the time the post-mortem runs, tens of thousands of users have experienced a degraded product.

The fix is a guardrail monitor: a lightweight comparison loop that checks treatment metrics against control metrics on a short interval and disables the experiment flag automatically when a threshold is breached. This how-to covers defining the thresholds, wiring the comparison, triggering an automatic halt, and alerting the team. It does not cover the statistical framework for choosing success metrics or sample sizes — those are in the experimentation guardrails overview.

The monitor queries per-variant metrics every two minutes; a threshold breach triggers an immediate flag halt and routes an alert with the offending metric values.

Prerequisites

OpenFeature server SDK ≥ 1.x with the experiment flag defined and running
Per-variant metric aggregation available in your metrics store (Prometheus, Datadog, or equivalent) — error rate, p95 latency, and at least one business metric split by resolved flag variant
A flag management API or CLI that can update flag state programmatically (flagctl or REST)
An alerting channel (PagerDuty, Slack webhook, or similar) reachable from the monitor process
The experiment flag follows the namespace.service.feature key schema and uses sticky bucketing for consistent variant attribution

Step-by-Step Procedure

Step 1 — Define guardrail metrics and thresholds

Before the experiment starts, write down every metric you are not allowed to harm and the maximum acceptable degradation. Commit these as structured metadata on the flag object so they are in the audit trail and are visible to anyone who reads the flag definition.

# flagd flag definition with guardrail metadata
flags:
  checkout.payments.express-pay:
    state: ENABLED
    variants:
      "treatment": true
      "control": false
    defaultVariant: "control"
    targeting:
      fractionalEvaluation:
        - { "var": "targetingKey" }
        - ["treatment", 50]
        - ["control",  50]
    metadata:
      owner: "payments-team"
      expiry: "2026-07-20"
      experiment_type: "ab"
      guardrails:
        - metric: "error_rate"
          comparison: "relative_increase"
          threshold: 0.01          # halt if treatment error rate > control + 1%
        - metric: "p95_latency_ms"
          comparison: "relative_increase"
          threshold: 0.10          # halt if p95 > control + 10%
        - metric: "cart_abandonment_rate"
          comparison: "relative_increase"
          threshold: 0.02          # halt if abandonment > control + 2pp

Do not set thresholds tighter than your normal day-to-day metric variance. If your error rate fluctuates ±0.5% without any experiment, a 0.3% threshold will fire constantly on noise. Baseline your metrics for at least a week before setting guardrail values.

Step 2 — Wire a monitor that compares variant metrics

Build a monitor that queries per-variant aggregates on a short, fixed interval. Two minutes is a reasonable starting point — short enough to catch a fast-moving regression, long enough for the aggregation window to smooth out single-request spikes.

# guardrail_monitor.py
import asyncio, httpx, os, sys, json, math

FLAG_KEY = "checkout.payments.express-pay"
FLAG_API = os.environ["FLAG_API_URL"]
FLAG_TOKEN = os.environ["FLAG_API_TOKEN"]
METRICS_API = os.environ["METRICS_API_URL"]
ALERT_WEBHOOK = os.environ["ALERT_WEBHOOK_URL"]

GUARDRAILS = [
    {"metric": "error_rate",           "threshold": 0.01},
    {"metric": "p95_latency_ms",       "threshold": 0.10},
    {"metric": "cart_abandonment_rate","threshold": 0.02},
]
MIN_OBSERVATIONS = 100  # do not fire before this many events per variant per metric


async def fetch_metric(client: httpx.AsyncClient, metric: str, variant: str) -> dict:
    resp = await client.get(
        f"{METRICS_API}/query",
        params={"flag": FLAG_KEY, "variant": variant, "metric": metric, "window": "5m"},
    )
    resp.raise_for_status()
    return resp.json()  # {"value": float, "count": int}


async def halt_experiment(client: httpx.AsyncClient, reason: str) -> None:
    await client.patch(
        f"{FLAG_API}/v1/flags/{FLAG_KEY}",
        json={"defaultVariant": "control", "state": "DISABLED"},
        headers={"Authorization": f"Bearer {FLAG_TOKEN}"},
    )
    await client.post(
        ALERT_WEBHOOK,
        json={"text": f"AUTO-HALT: {FLAG_KEY} — {reason}"},
    )
    print(f"AUTO-HALT triggered: {reason}", file=sys.stderr)


async def run_once() -> None:
    async with httpx.AsyncClient(timeout=10) as client:
        for g in GUARDRAILS:
            control   = await fetch_metric(client, g["metric"], "control")
            treatment = await fetch_metric(client, g["metric"], "treatment")

            if control["count"] < MIN_OBSERVATIONS or treatment["count"] < MIN_OBSERVATIONS:
                continue  # too few observations — skip this round

            if control["value"] == 0:
                continue  # avoid division-by-zero on zero baseline

            relative_delta = (treatment["value"] - control["value"]) / control["value"]
            if relative_delta > g["threshold"]:
                reason = (
                    f"{g['metric']}: treatment={treatment['value']:.4f} "
                    f"control={control['value']:.4f} "
                    f"delta={relative_delta:.1%} > threshold={g['threshold']:.1%}"
                )
                await halt_experiment(client, reason)
                return  # halt fires once; subsequent checks are moot


if __name__ == "__main__":
    asyncio.run(run_once())

Run this as a cron job every two minutes, or embed it in your existing monitoring service as a recurring task.

Pitfall: comparing treatment to a time-shifted baseline (yesterday’s control numbers) rather than the concurrent control arm confounds the experiment with time-of-day variation. Always compare treatment and control groups measured in the same time window.

Step 3 — Trigger an automatic flag halt on breach

The halt in the monitor above disables the flag and forces the control variant. Verify that your flag management API accepts a PATCH that sets both state: DISABLED and defaultVariant: "control" atomically. If not, use a force-variant API call instead — the emergency kill-switch runbook shows the pattern for forcing a specific variant through the control plane without disabling the flag object.

# Verify the halt is wired correctly — dry-run without a live experiment
curl -s -X PATCH "${FLAG_API}/v1/flags/checkout.payments.express-pay" \
  -H "Authorization: Bearer ${FLAG_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{"state": "DISABLED", "defaultVariant": "control"}' \
  | jq '{state, defaultVariant}'
# expect: {"state": "DISABLED", "defaultVariant": "control"}

Step 4 — Alert with enough context to act immediately

The alert must carry the offending metric, both raw values, the relative delta, and a link to the flag definition. An alert that says “experiment halted” without these values forces the on-call engineer to reconstruct the context from scratch.

# Alert payload — include all values needed for immediate action
alert_body = {
    "text": f":rotating_light: AUTO-HALT: `{FLAG_KEY}`",
    "attachments": [{
        "color": "danger",
        "fields": [
            {"title": "Guardrail metric",   "value": g["metric"],               "short": True},
            {"title": "Treatment value",    "value": f"{treatment['value']:.4f}","short": True},
            {"title": "Control value",      "value": f"{control['value']:.4f}",  "short": True},
            {"title": "Relative delta",     "value": f"{relative_delta:.1%}",    "short": True},
            {"title": "Threshold",          "value": f"{g['threshold']:.1%}",    "short": True},
            {"title": "Flag key",           "value": FLAG_KEY,                   "short": False},
        ],
    }],
}

Verification Step

Simulate a regression and confirm the auto-halt fires before trusting it in production:

# 1. Inject a synthetic error_rate spike for the treatment variant into your metrics store
curl -s -X POST "${METRICS_API}/inject" \
  -H 'Content-Type: application/json' \
  -d '{"flag":"checkout.payments.express-pay","variant":"treatment","metric":"error_rate","value":0.04,"count":500}'

# 2. Run the monitor once (not as a cron; invoke directly)
python guardrail_monitor.py

# 3. Confirm the flag is now disabled
flagctl get checkout.payments.express-pay --env staging -o json | jq '{state, defaultVariant}'
# expect: {"state":"DISABLED","defaultVariant":"control"}

# 4. Confirm the alert webhook received the payload
curl -s "${ALERT_WEBHOOK}/last" | jq '.text'
# expect the AUTO-HALT message with metric values

# 5. Restore for the real experiment
flagctl set checkout.payments.express-pay --state ENABLED --env staging

Gotchas & Edge Cases

The halt fires during a metric blip, not a real regression: if your metrics store has a delayed ingestion pipeline, a query at T+0 may return stale data that appears to spike. Add a confirmation window — require the threshold to be exceeded for two consecutive monitor runs before triggering a halt. This doubles detection latency but eliminates false positives from ingestion lag.
Concurrent experiments can cross-contaminate guardrail metrics: if a second experiment is running on the same service at the same time, its treatment arm may be causing the regression your monitor is attributing to your experiment’s treatment. Segment guardrail metrics by all active flag keys before drawing conclusions.
Progressive delivery](/feature-flag-architecture-lifecycle-management/implementing-progressive-delivery-workflows/) ramps and experiments need separate halt logic: a ramp should halt and hold at the current percentage; an experiment should halt and revert to control. Use the flag metadata type field (rollout vs experiment) to differentiate halt behaviour in the same monitor.

Troubleshooting & FAQ

The monitor halted the experiment but I cannot see which metric triggered it — where do I look?

The reason string from the halt call should appear in both the alert payload and the flag’s audit trail. If your flag management API does not accept a reason field on the PATCH call, write the reason to a structured log event alongside the halt at the time the monitor fires. See building audit trails for compliance for the full audit log pattern.

How often should the monitor run?

Every 1–5 minutes is the practical range. Below 1 minute, you risk firing on single-interval metric noise and your flag API receives halt calls before the aggregation window is stable. Above 5 minutes, a fast-moving regression (a cache poisoning bug, a sudden upstream timeout cascade) can affect a significant user population before the halt fires.

Can I set different guardrail thresholds for the first hour versus steady state?

Yes. Add a warmup_minutes field to the guardrail metadata and skip threshold checks in the monitor until now - experiment_start > warmup_minutes. The first minutes of an experiment typically have high variance from cache warming and first-visit effects; a fixed warmup window avoids premature halts during that period.