Tuning SSE Streaming Connections for Flag Updates

Q: Heartbeats are emitted but the stream still drops

Comment-style heartbeats may not register as traffic for some intermediaries. Switch to a named SSE event with actual data so all intermediaries recognize the stream as active.

Q: After reconnect, some flags still return stale variants

The distributed cache was not invalidated after resync. The local rule set is fresh but requests hit a stale shared-cache entry. Invalidate relevant cache keys after resync or reduce the shared cache TTL below the heartbeat interval.

This how-to is part of Polling vs Streaming Flag Synchronization. Streaming transports deliver flag changes in sub-second time, but long-lived SSE connections have a well-known failure mode that polling never hits: an intermediary — a load balancer, reverse proxy, or NAT gateway — closes connections it has not seen traffic on for longer than its idle timeout, and the SDK on the other end never finds out. The connection appears open. Evaluations keep returning the last-known-good state. Flags silently go stale.

This guide covers how to detect that failure, configure heartbeats that keep the connection alive, align proxy timeouts with those heartbeats, and resync cleanly when a drop is eventually detected.

SSE lifecycle: heartbeat events keep the connection alive through proxy idle timeouts; a missed heartbeat triggers reconnect with backoff; reconnect is always followed by a full flag-state resync.

Prerequisites

OpenFeature server SDK ≥ 1.x with an SSE-capable provider (flagd supports both gRPC streaming and SSE)
Access to configure idle-timeout settings on your load balancer or reverse proxy (nginx, ALB, Envoy, etc.)
Provider connection-state events wired to a metric — at minimum flag_sdk_connected gauge
Polling vs streaming decision already made and streaming chosen as the primary transport
Reconnection backoff logic in place — see exponential backoff for SDK reconnection if not

Step 1 — Set heartbeat and keep-alive intervals on the server

Configure the control plane (flagd or your provider) to emit a periodic comment or named heartbeat event on every open SSE stream. Heartbeats keep the connection from appearing idle to any intermediary and give the client a signal it can watch for to detect a dead stream.

flagd sends a gRPC keepalive ping on its streaming endpoint by default; for HTTP/SSE providers, configure the heartbeat interval explicitly:

# flagd configuration — flagd-config.yaml
sync_providers:
  - uri: "file:/etc/flagd/flags.json"
    providerID: core

# In the HTTP provider section or your SSE gateway config:
http:
  heartbeat_interval: 20s   # emit a comment ": heartbeat" every 20s
  idle_timeout: 0           # disable the server-side idle timeout for SSE streams

For a custom SSE endpoint, emit the heartbeat comment directly:

// handler.go (Go net/http SSE endpoint)
func flagStreamHandler(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type",  "text/event-stream")
    w.Header().Set("Cache-Control", "no-cache")
    w.Header().Set("Connection",    "keep-alive")
    w.Header().Set("X-Accel-Buffering", "no") // disable nginx proxy buffering

    flusher, ok := w.(http.Flusher)
    if !ok { http.Error(w, "streaming unsupported", 500); return }

    ticker := time.NewTicker(20 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-r.Context().Done():
            return
        case <-ticker.C:
            fmt.Fprintf(w, ": heartbeat\n\n") // SSE comment — no client parsing needed
            flusher.Flush()
        case event := <-flagEvents:
            fmt.Fprintf(w, "data: %s\n\n", event)
            flusher.Flush()
        }
    }
}

Pitfall: nginx buffers SSE responses by default. Without X-Accel-Buffering: no, clients never receive events until the buffer fills. Set this header or configure proxy_buffering off in your nginx block.

Step 2 — Configure proxy and load-balancer idle timeouts above the heartbeat interval

Every intermediary between the SDK and the control plane has an idle-timeout — the time it waits before closing a connection it has seen no bytes on. Set each intermediary’s idle timeout to at least heartbeat_interval + safety_margin. A safety margin of 50–100% is reasonable to absorb clock drift and brief traffic pauses.

nginx reverse proxy:

# nginx.conf — upstream block for the flag control plane
upstream flagd {
    server flagd.internal:8080;
    keepalive 32;
}

server {
    location /flags/stream {
        proxy_pass         http://flagd;
        proxy_read_timeout 60s;    # must be > heartbeat_interval (20s) + margin
        proxy_buffering    off;
        proxy_set_header   Connection "";
        proxy_http_version 1.1;
    }
}

AWS ALB:

Set the ALB idle timeout to at least 60s via the console or CLI:

aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn "$ALB_ARN" \
  --attributes Key=idle_timeout.timeout_seconds,Value=60

Envoy:

# envoy.yaml — route config for the flag-sync cluster
route_config:
  virtual_hosts:
    - name: flagd
      routes:
        - match: { prefix: /flags/stream }
          route:
            cluster: flagd_cluster
            timeout: 0s          # 0 = no per-request timeout for streaming routes
            idle_timeout: 60s    # idle timeout above heartbeat interval

Pitfall: AWS NLBs enforce a hard 350-second idle timeout that cannot be extended. If your heartbeat is longer than 350s you will see unexplained drops on NLB-fronted services. Keep heartbeat intervals well below 300s.

Step 3 — Detect a dead stream and trigger reconnect

The client side needs its own heartbeat watchdog. If no event (including heartbeat comments) arrives within heartbeat_interval * 2, the stream is likely dead — even if the TCP connection itself still appears open.

// watchdog.ts
const HEARTBEAT_MS       = 20_000;  // must match server config
const HEARTBEAT_TIMEOUT  = HEARTBEAT_MS * 2;

function watchStream(provider: FlagdProvider): () => void {
  let lastSeen = Date.now();

  // Reset the clock on any provider event (flag change or heartbeat)
  const onAny = () => { lastSeen = Date.now(); };
  provider.on('flagChange', onAny);
  provider.on('heartbeat', onAny);

  const watchdogTimer = setInterval(() => {
    if (Date.now() - lastSeen > HEARTBEAT_TIMEOUT) {
      metrics.increment('flag.stream.dead_detected');
      provider.reconnect(); // triggers the backoff loop from Step 4
    }
  }, 5_000); // check every 5s

  return () => clearInterval(watchdogTimer); // cleanup on shutdown
}

Pitfall: if the provider does not expose a heartbeat event, hook into the raw EventSource message event for the SSE comment lines instead. The key requirement is that something resets the watchdog clock on every heartbeat interval.

Step 4 — Resync the full flag state on reconnect

When the watchdog triggers a reconnect and the connection is re-established, pull the complete rule set from the control plane. Do not assume your local state is correct — the stream was dead for an unknown duration and any number of flag changes may have been missed.

// provider-events.ts
import { OpenFeature, ProviderEvents } from '@openfeature/server-sdk';

OpenFeature.addHandler(ProviderEvents.Ready, async () => {
  // Always resync on reconnect, not just on first init
  await provider.initialize(OpenFeature.getContext());
  metrics.gauge('flag.stream.connected', 1);
  metrics.increment('flag.stream.resync');
});

OpenFeature.addHandler(ProviderEvents.Stale, () => {
  metrics.gauge('flag.stream.connected', 0);
});

If you use a distributed cache in front of the SDK, invalidate or refresh the relevant partition after resync to prevent the shared cache from serving a stale snapshot that the local rule set has already corrected.

Verification

Confirm the stream survives an idle period and recovers from a forced drop:

# 1. Confirm heartbeats are being emitted by the server
curl -N -H "Accept: text/event-stream" \
  http://flagd.internal:8080/flags/stream 2>&1 | head -20
# expect lines like ": heartbeat" every 20 seconds

# 2. Simulate a proxy idle-timeout drop by blocking traffic for 30s
iptables -A INPUT -p tcp --sport 8080 -j DROP
sleep 30
iptables -D INPUT -p tcp --sport 8080 -j DROP

# 3. Watch the watchdog detect the dead stream and reconnect
watch -n 1 'curl -s http://localhost:9090/metrics | grep -E "flag_stream_(connected|dead_detected|resync)"'
# expect: dead_detected increments, then connected returns to 1, then resync increments

# 4. Verify flags are consistent after resync
curl -s http://localhost:3000/debug/flags/api.search.semantic-rerank | jq .
# expect: { "variant": <current-variant>, "reason": "TARGETING_MATCH" or "DEFAULT" }

Gotchas & Edge Cases

Double-buffering on nginx + app proxy: if your service runs behind both nginx and an application-level proxy (e.g. an Envoy sidecar), both layers enforce idle timeouts independently. Set each one above the heartbeat interval — missing one is enough to kill the stream silently.
HTTP/2 multiplexing: when the provider uses HTTP/2, a single TCP connection carries multiple streams. The TCP-level keepalive and the HTTP/2 PING frame operate independently from SSE heartbeats. Confirm which layer your provider’s SSE implementation sits on before tuning.
Serverless and ephemeral runtimes: streaming connections are impractical in runtimes that recycle the process between requests (AWS Lambda, Cloud Run). Fall back to polling for those environments as documented in polling vs streaming.

Troubleshooting & FAQ

The stream drops every 60 seconds precisely — what is closing it?

A 60-second idle timeout on an intermediary is the almost certain cause. Match the culprit by checking nginx proxy_read_timeout, ALB idle timeout, and any Envoy route idle_timeout in order. The drop interval directly reveals the timeout value; raise it to heartbeat_interval + margin.

Heartbeats are emitted but the stream still drops

The heartbeat comment (": heartbeat") counts as data from the SDK’s perspective but may not count as traffic for some intermediaries that only track HTTP request-response cycles. Switch from comment-style heartbeats to a named SSE event (event: heartbeat\ndata: {}\n\n) — this generates actual event data that all intermediaries recognize as stream traffic.

After reconnect, some flags still return stale variants

The resync call ran but the distributed cache was not invalidated. The SDK’s local rule set is fresh but requests are hitting a stale shared-cache entry with a longer TTL. Invalidate the relevant cache keys after resync, or reduce the shared cache TTL to less than your heartbeat interval.