Exponential Backoff for SDK Reconnection

This how-to is part of Server-Side SDK Integration Patterns. It solves one specific problem: when a server-side SDK loses its connection to the control plane, naively retrying immediately — or on a fixed interval — causes every replica in your fleet to hammer the control plane simultaneously the moment it comes back online. That thundering herd can delay recovery, push the control plane into another failure mode, and extend the window in which your flags serve stale state.

The fix is exponential backoff with full jitter: each replica waits a randomized, growing delay before each retry attempt. The delays spread across the recovery window so connection requests arrive as a gentle ramp rather than a synchronized spike.

Exponential backoff retry timeline with full jitter Three replicas each retry at growing but individually randomized intervals, spreading reconnection attempts across the recovery window instead of spiking simultaneously. time disconnect CP online A 1s 3s 9s B C failed retry successful reconnect
Full jitter spreads retry attempts across the recovery window; each replica picks a random delay within the exponential envelope so the control plane sees a ramp, not a spike.

Prerequisites

Step 1 — Detect a dropped connection

Listen to the provider’s lifecycle events. Most OpenFeature-compatible providers emit events when the connection drops, so you can track when a replica enters a degraded state and when it returns to healthy.

import { OpenFeature, ProviderEvents } from '@openfeature/server-sdk';
import { metrics } from './observability';

const client = OpenFeature.getClient('api');

OpenFeature.addHandler(ProviderEvents.Stale, () => {
  metrics.increment('flag.sdk.connection_lost');
  // Provider is now serving last-known-good; reconnection starts automatically
});

OpenFeature.addHandler(ProviderEvents.Reconnecting, (details) => {
  metrics.increment('flag.sdk.reconnect_attempt', {
    attempt: String(details?.attemptNumber ?? 0),
  });
});

OpenFeature.addHandler(ProviderEvents.Ready, () => {
  metrics.gauge('flag.sdk.connected', 1);
});

Pitfall: if your provider does not emit Stale events, you will not detect a silent connection drop. Confirm this by blocking the control-plane port for 10s and verifying the metric increments.

Step 2 — Retry with exponential backoff and full jitter

Implement the backoff loop manually if the provider does not include one, or configure the built-in parameters if it does. Full jitter picks the actual delay uniformly at random between zero and the exponential cap for that attempt.

// reconnect.ts
const BASE_MS  = 1_000;   // 1s base
const MAX_MS   = 60_000;  // 60s ceiling
const MAX_ATTEMPTS = 10;

function backoffDelayMs(attempt: number): number {
  // Exponential cap: base * 2^attempt, capped at MAX_MS
  const cap = Math.min(MAX_MS, BASE_MS * Math.pow(2, attempt));
  // Full jitter: random value in [0, cap)
  return Math.random() * cap;
}

async function reconnectWithBackoff(
  reconnect: () => Promise<void>,
  onExhausted: () => void,
): Promise<void> {
  for (let attempt = 0; attempt < MAX_ATTEMPTS; attempt++) {
    const delay = backoffDelayMs(attempt);
    await sleep(delay);
    try {
      await reconnect();
      return; // success — exit loop
    } catch (err) {
      metrics.increment('flag.sdk.reconnect_attempt', { attempt: String(attempt) });
      if (attempt === MAX_ATTEMPTS - 1) onExhausted();
    }
  }
}

const sleep = (ms: number) => new Promise(r => setTimeout(r, ms));

The same logic in Go, using the gobreaker pattern already in your resilience layer:

// backoff.go
import (
    "math"
    "math/rand"
    "time"
)

const baseMs    = 1_000
const maxMs     = 60_000
const maxRetries = 10

func backoffDelay(attempt int) time.Duration {
    cap := math.Min(maxMs, float64(baseMs)*math.Pow(2, float64(attempt)))
    jittered := rand.Float64() * cap // full jitter: [0, cap)
    return time.Duration(jittered) * time.Millisecond
}

Pitfall: decorrelated jitter (sleep = min(cap, base * 3 * rand)) is an alternative that can spread arrivals slightly better, but full jitter is simpler and sufficient for flag SDK reconnection volumes.

Step 3 — Cap the maximum delay

Set MAX_MS to a value that balances recovery time against control-plane protection. A 60-second ceiling means at most a minute of stale state per replica in the worst case; dropping it to 30s doubles the reconnection request rate at the high-attempt end. Document the ceiling in your runbook so on-call engineers know the worst-case propagation window for a kill switch.

# flagd-provider-config.yaml
reconnect:
  base_delay_ms: 1000
  max_delay_ms: 60000
  max_attempts: 10
  jitter: full           # "full" | "equal" | "none"

If the provider config exposes these as environment variables, set them at the pod level so they are visible to operators without a code change.

Step 4 — Resync the full flag state on reconnect

After a successful reconnect, pull the full rule set from the control plane rather than only applying the most recent delta. During the disconnection window your replica may have missed multiple flag changes; applying only the latest event leaves the rule set permanently inconsistent.

OpenFeature.addHandler(ProviderEvents.Ready, async () => {
  // Full resync: re-download and compile the complete rule set
  await provider.initialize(OpenFeature.getContext());
  metrics.gauge('flag.sdk.connected', 1);
  metrics.increment('flag.sdk.resync');
});

Cross-check against distributed caching for flag evaluations: if you have a second-level shared cache, invalidate or refresh the relevant partition on resync so the shared layer does not serve flags that the local layer already knows are stale.

Step 5 — Surface a connection metric for alerting

Export the connection state as a numeric gauge (1 = connected, 0 = disconnected) and alert when it stays at zero for more than two backoff intervals. This surfaces silent connection failures that would otherwise go unnoticed until an on-call engineer noticed wrong variants in traces.

// health.ts — expose for Prometheus scrape
import { register, Gauge } from 'prom-client';

const sdkConnected = new Gauge({
  name: 'flag_sdk_connected',
  help: '1 if the flag provider is connected to the control plane, 0 otherwise',
  labelNames: ['provider'],
});

OpenFeature.addHandler(ProviderEvents.Ready,  () => sdkConnected.set({ provider: 'flagd' }, 1));
OpenFeature.addHandler(ProviderEvents.Stale,  () => sdkConnected.set({ provider: 'flagd' }, 0));
OpenFeature.addHandler(ProviderEvents.Error,  () => sdkConnected.set({ provider: 'flagd' }, 0));

Verification

Simulate a control-plane outage and confirm the backoff curve behaves as expected:

# Block egress to the control plane on a single replica
iptables -A OUTPUT -p tcp --dport 8013 -j DROP

# Watch the reconnect_attempt counter in real time
watch -n 2 'curl -s http://localhost:9090/metrics | grep flag_sdk_reconnect_attempt'

# Confirm growing delays in the log
docker logs --since 2m my-api-pod | grep "flag.sdk.reconnect_attempt" | \
  awk '{print $1, $NF}' # timestamp + attempt number

# Lift the block and confirm a full resync fires
iptables -D OUTPUT -p tcp --dport 8013 -j DROP
watch -n 1 'curl -s http://localhost:9090/metrics | grep flag_sdk_connected'
# expect: flag_sdk_connected{provider="flagd"} 1 within MAX_MS of the block lift

Gotchas & Edge Cases

Troubleshooting & FAQ

The replicas are all retrying at the same time despite jitter

Full jitter in [0, cap) is uniform but still has a nonzero chance of many replicas picking small values simultaneously, especially at attempt 0 where cap is small. Add a fixed per-pod offset derived from the pod name or IP: delay = backoffDelayMs(attempt) + hashToBucket(podName, 500). This spreads the first retry attempt across a 500ms window.

Reconnects succeed but some flags are wrong after recovery

The provider reconnected but did not resync the full state — it applied only the last delta. Check Step 4: provider.initialize() must be called on every Ready event after a Stale transition, not only on the first boot. If the provider SDK does not expose this call, file an issue or fetch the rule-set snapshot manually via the control-plane REST API.

How do I set MAX_ATTEMPTS correctly for my SLA?

Estimate the maximum acceptable outage duration for the control plane and divide by your expected MAX_MS. For a 10-minute SLA with a 60s ceiling: 10 * 60 / 60 = 10 attempts, which is the default above. Adjust BASE_MS and MAX_MS together so the last attempt lands near the end of your acceptable window.