Tuning SSE Streaming Connections for Flag Updates
This how-to is part of Polling vs Streaming Flag Synchronization. Streaming transports deliver flag changes in sub-second time, but long-lived SSE connections have a well-known failure mode that polling never hits: an intermediary — a load balancer, reverse proxy, or NAT gateway — closes connections it has not seen traffic on for longer than its idle timeout, and the SDK on the other end never finds out. The connection appears open. Evaluations keep returning the last-known-good state. Flags silently go stale.
This guide covers how to detect that failure, configure heartbeats that keep the connection alive, align proxy timeouts with those heartbeats, and resync cleanly when a drop is eventually detected.
Prerequisites
flag_sdk_connectedgauge- Polling vs streaming decision already made and streaming chosen as the primary transport
- exponential backoff for SDK reconnection if not
Step 1 — Set heartbeat and keep-alive intervals on the server
Configure the control plane (flagd or your provider) to emit a periodic comment or named heartbeat event on every open SSE stream. Heartbeats keep the connection from appearing idle to any intermediary and give the client a signal it can watch for to detect a dead stream.
flagd sends a gRPC keepalive ping on its streaming endpoint by default; for HTTP/SSE providers, configure the heartbeat interval explicitly:
# flagd configuration — flagd-config.yaml
sync_providers:
- uri: "file:/etc/flagd/flags.json"
providerID: core
# In the HTTP provider section or your SSE gateway config:
http:
heartbeat_interval: 20s # emit a comment ": heartbeat" every 20s
idle_timeout: 0 # disable the server-side idle timeout for SSE streams
For a custom SSE endpoint, emit the heartbeat comment directly:
// handler.go (Go net/http SSE endpoint)
func flagStreamHandler(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/event-stream")
w.Header().Set("Cache-Control", "no-cache")
w.Header().Set("Connection", "keep-alive")
w.Header().Set("X-Accel-Buffering", "no") // disable nginx proxy buffering
flusher, ok := w.(http.Flusher)
if !ok { http.Error(w, "streaming unsupported", 500); return }
ticker := time.NewTicker(20 * time.Second)
defer ticker.Stop()
for {
select {
case <-r.Context().Done():
return
case <-ticker.C:
fmt.Fprintf(w, ": heartbeat\n\n") // SSE comment — no client parsing needed
flusher.Flush()
case event := <-flagEvents:
fmt.Fprintf(w, "data: %s\n\n", event)
flusher.Flush()
}
}
}
Pitfall: nginx buffers SSE responses by default. Without X-Accel-Buffering: no, clients never receive events until the buffer fills. Set this header or configure proxy_buffering off in your nginx block.
Step 2 — Configure proxy and load-balancer idle timeouts above the heartbeat interval
Every intermediary between the SDK and the control plane has an idle-timeout — the time it waits before closing a connection it has seen no bytes on. Set each intermediary’s idle timeout to at least heartbeat_interval + safety_margin. A safety margin of 50–100% is reasonable to absorb clock drift and brief traffic pauses.
nginx reverse proxy:
# nginx.conf — upstream block for the flag control plane
upstream flagd {
server flagd.internal:8080;
keepalive 32;
}
server {
location /flags/stream {
proxy_pass http://flagd;
proxy_read_timeout 60s; # must be > heartbeat_interval (20s) + margin
proxy_buffering off;
proxy_set_header Connection "";
proxy_http_version 1.1;
}
}
AWS ALB:
Set the ALB idle timeout to at least 60s via the console or CLI:
aws elbv2 modify-load-balancer-attributes \
--load-balancer-arn "$ALB_ARN" \
--attributes Key=idle_timeout.timeout_seconds,Value=60
Envoy:
# envoy.yaml — route config for the flag-sync cluster
route_config:
virtual_hosts:
- name: flagd
routes:
- match: { prefix: /flags/stream }
route:
cluster: flagd_cluster
timeout: 0s # 0 = no per-request timeout for streaming routes
idle_timeout: 60s # idle timeout above heartbeat interval
Pitfall: AWS NLBs enforce a hard 350-second idle timeout that cannot be extended. If your heartbeat is longer than 350s you will see unexplained drops on NLB-fronted services. Keep heartbeat intervals well below 300s.
Step 3 — Detect a dead stream and trigger reconnect
The client side needs its own heartbeat watchdog. If no event (including heartbeat comments) arrives within heartbeat_interval * 2, the stream is likely dead — even if the TCP connection itself still appears open.
// watchdog.ts
const HEARTBEAT_MS = 20_000; // must match server config
const HEARTBEAT_TIMEOUT = HEARTBEAT_MS * 2;
function watchStream(provider: FlagdProvider): () => void {
let lastSeen = Date.now();
// Reset the clock on any provider event (flag change or heartbeat)
const onAny = () => { lastSeen = Date.now(); };
provider.on('flagChange', onAny);
provider.on('heartbeat', onAny);
const watchdogTimer = setInterval(() => {
if (Date.now() - lastSeen > HEARTBEAT_TIMEOUT) {
metrics.increment('flag.stream.dead_detected');
provider.reconnect(); // triggers the backoff loop from Step 4
}
}, 5_000); // check every 5s
return () => clearInterval(watchdogTimer); // cleanup on shutdown
}
Pitfall: if the provider does not expose a heartbeat event, hook into the raw EventSource message event for the SSE comment lines instead. The key requirement is that something resets the watchdog clock on every heartbeat interval.
Step 4 — Resync the full flag state on reconnect
When the watchdog triggers a reconnect and the connection is re-established, pull the complete rule set from the control plane. Do not assume your local state is correct — the stream was dead for an unknown duration and any number of flag changes may have been missed.
// provider-events.ts
import { OpenFeature, ProviderEvents } from '@openfeature/server-sdk';
OpenFeature.addHandler(ProviderEvents.Ready, async () => {
// Always resync on reconnect, not just on first init
await provider.initialize(OpenFeature.getContext());
metrics.gauge('flag.stream.connected', 1);
metrics.increment('flag.stream.resync');
});
OpenFeature.addHandler(ProviderEvents.Stale, () => {
metrics.gauge('flag.stream.connected', 0);
});
If you use a distributed cache in front of the SDK, invalidate or refresh the relevant partition after resync to prevent the shared cache from serving a stale snapshot that the local rule set has already corrected.
Verification
Confirm the stream survives an idle period and recovers from a forced drop:
# 1. Confirm heartbeats are being emitted by the server
curl -N -H "Accept: text/event-stream" \
http://flagd.internal:8080/flags/stream 2>&1 | head -20
# expect lines like ": heartbeat" every 20 seconds
# 2. Simulate a proxy idle-timeout drop by blocking traffic for 30s
iptables -A INPUT -p tcp --sport 8080 -j DROP
sleep 30
iptables -D INPUT -p tcp --sport 8080 -j DROP
# 3. Watch the watchdog detect the dead stream and reconnect
watch -n 1 'curl -s http://localhost:9090/metrics | grep -E "flag_stream_(connected|dead_detected|resync)"'
# expect: dead_detected increments, then connected returns to 1, then resync increments
# 4. Verify flags are consistent after resync
curl -s http://localhost:3000/debug/flags/api.search.semantic-rerank | jq .
# expect: { "variant": <current-variant>, "reason": "TARGETING_MATCH" or "DEFAULT" }
Gotchas & Edge Cases
- Double-buffering on nginx + app proxy: if your service runs behind both nginx and an application-level proxy (e.g. an Envoy sidecar), both layers enforce idle timeouts independently. Set each one above the heartbeat interval — missing one is enough to kill the stream silently.
- HTTP/2 multiplexing: when the provider uses HTTP/2, a single TCP connection carries multiple streams. The TCP-level keepalive and the HTTP/2 PING frame operate independently from SSE heartbeats. Confirm which layer your provider’s SSE implementation sits on before tuning.
- Serverless and ephemeral runtimes: streaming connections are impractical in runtimes that recycle the process between requests (AWS Lambda, Cloud Run). Fall back to polling for those environments as documented in polling vs streaming.
Troubleshooting & FAQ
The stream drops every 60 seconds precisely — what is closing it?
A 60-second idle timeout on an intermediary is the almost certain cause. Match the culprit by checking nginx proxy_read_timeout, ALB idle timeout, and any Envoy route idle_timeout in order. The drop interval directly reveals the timeout value; raise it to heartbeat_interval + margin.
Heartbeats are emitted but the stream still drops
The heartbeat comment (": heartbeat") counts as data from the SDK’s perspective but may not count as traffic for some intermediaries that only track HTTP request-response cycles. Switch from comment-style heartbeats to a named SSE event (event: heartbeat\ndata: {}\n\n) — this generates actual event data that all intermediaries recognize as stream traffic.
After reconnect, some flags still return stale variants
The resync call ran but the distributed cache was not invalidated. The SDK’s local rule set is fresh but requests are hitting a stale shared-cache entry with a longer TTL. Invalidate the relevant cache keys after resync, or reduce the shared cache TTL to less than your heartbeat interval.