High connection churn means the cluster’s total connection count changed significantly between collection epochs — either a burst of new connections, a sudden mass disconnect, or rapid cycling of clients connecting and disconnecting in tight loops. The default threshold flags an absolute delta of more than 500 connections per epoch in either direction. Both spikes and sudden drops indicate that something is disrupting normal client lifecycle, whether that is reconnect loops, authentication failures, slow consumer evictions, an outage in a downstream client population, or a network event taking out a wave of clients at once.
Every new NATS connection has a cost. The server performs a TCP handshake, negotiates TLS (if configured), authenticates the client, processes subscription registrations, and updates the internal subscription interest graph. In a healthy deployment, this cost is amortized over long-lived connections — clients connect once and stay connected for hours, days, or weeks. When connections churn rapidly, this setup cost is paid repeatedly, consuming CPU cycles and memory allocations that should be serving messages.
The log noise is a second-order problem. Each connect and disconnect generates log entries. At 500+ connections per epoch, the server log becomes a wall of connection events, burying the meaningful operational signals (slow consumers, route failures, JetStream errors) in noise. Operators searching logs during an incident must filter through thousands of connection entries to find what matters. In severe cases, the log volume itself becomes a performance issue if logs are written synchronously or shipped to a centralized system.
Connection churn also destabilizes the subscription interest graph. Every time a client connects and subscribes, the server propagates that interest to all cluster peers via routes. When the client disconnects seconds later, the server propagates the removal. Hundreds of clients churning simultaneously create a constant stream of subscription interest updates flowing between cluster members, consuming internal bandwidth that should carry application messages. This can measurably increase message delivery latency for stable, well-behaved clients sharing the same cluster.
Misconfigured client reconnect backoff. The default NATS client reconnect uses randomized jitter, but some applications override this with zero or near-zero backoff, creating a tight reconnect loop. When a client disconnects and immediately reconnects without delay, the server pays the full connection setup cost repeatedly.
Authentication failures causing immediate disconnect after connect. A client with invalid credentials connects, gets rejected, and immediately retries. Without exponential backoff, this creates a tight loop of failed connection attempts. Each attempt consumes server resources even though it fails. Common when credentials are rotated but not all clients are updated.
Slow consumer evictions triggering reconnect loops. When the server disconnects a slow consumer, the client library’s auto-reconnect kicks in and establishes a new connection. If the underlying throughput mismatch isn’t fixed, the client gets evicted again, reconnects again, and the cycle repeats. Each eviction-reconnect cycle counts as connection churn.
Load balancer health checks creating ephemeral connections. Kubernetes liveness/readiness probes, load balancer health checks, or monitoring tools that open a TCP connection to the NATS port, verify it’s listening, and immediately close. At high probe frequency across many targets, this generates significant churn.
Network instability causing mass reconnects. A network blip disconnects many clients simultaneously. They all reconnect at once, creating a spike in new connections. If the network remains unstable, the cycle repeats. This pattern shows up as periodic churn spikes correlated with network events.
Short-lived processes creating new connections per request. Serverless functions, batch jobs, or request-scoped microservices that create a new NATS connection for each invocation instead of reusing a persistent connection. Each function invocation = one connect + one disconnect.
Check the connection delta between observation intervals:
nats server listCompare the Conns column across repeated observations. A rapidly changing count indicates churn. For more precise measurement:
# Check total lifetime connections vs currentcurl -s http://<server-host>:8222/varz | jq '{connections, total_connections}'The total_connections field is a monotonically increasing counter. A large delta over a short period confirms high churn.
Examine recently closed connections:
curl -s "http://<server-host>:8222/connz?state=closed&limit=50&sort=stop" | jq '.connections[] | {cid, name, reason, stop, rtt, ip}'Key fields:
Slow Consumer, Authentication Failure, Client Closed, Stale Connection.nats sub '$SYS.ACCOUNT.*.CONNECT' # publish-side: $SYS.ACCOUNT.<acct>.DISCONNECTThis streams connection and disconnection events as they happen. Look for patterns: the same client name appearing repeatedly, the same IP cycling connections, or bursts of disconnections followed by bursts of connections.
# Search server logs for auth failuresjournalctl -u nats-server --since "30 minutes ago" | grep -i "auth"Repeated Authorization Violation entries from the same source confirm an auth-driven churn loop.
curl -s 'http://localhost:8222/connz?state=closed&sort=stop&limit=50' | jq '.connections[] | select(.reason | test("Slow"))'If slow consumer counts are also elevated, the churn is likely driven by slow consumer evictions triggering reconnects.
A dramatic connection count change was detected. The first decision is whether you’re looking at a connection spike (clients pouring in) or a connection drop (clients leaving en masse) — the diagnosis path differs:
connz?state=closed with the live connz to confirm.Review client connection URLs and reconnect settings, then proceed to the diagnose-and-stabilize steps below.
If a specific client or IP is generating the bulk of churn, block it temporarily while you fix the root cause:
1# Server config: limit connections per account to stop runaway clients2accounts {3 APP {4 users = [{user: app, password: pass}]5 max_connections: 1006 }7}If authentication failures are the cause, fix the credentials. If the old credentials can’t be updated immediately, temporarily allow the old credentials alongside the new ones to stop the churn while you plan a coordinated rollover.
Add exponential backoff to reconnection logic. Most NATS client libraries support configurable reconnect behavior. Ensure clients use jitter and backoff to prevent thundering herd reconnects:
1// Go client — configure reconnect with backoff2nc, err := nats.Connect(url,3 nats.MaxReconnects(-1), // Reconnect forever4 nats.ReconnectWait(2 * time.Second), // Base wait time5 nats.CustomReconnectDelay(func(attempts int) time.Duration {6 // Exponential backoff with jitter, max 60s7 delay := time.Duration(math.Min(8 float64(time.Second)*math.Pow(2, float64(attempts)),9 float64(60*time.Second),10 ))11 jitter := time.Duration(rand.Int63n(int64(time.Second)))12 return delay + jitter13 }),14)1# Python (nats.py) — reconnect options2nc = await nats.connect(3 servers=["nats://s1:4222", "nats://s2:4222"],4 max_reconnect_attempts=-1,5 reconnect_time_wait=2, # seconds between attempts6)Use persistent connections. For serverless or request-scoped workloads, maintain a connection pool or singleton connection that persists across invocations:
1// Instead of creating a new connection per request:2func handleRequest() {3 nc, _ := nats.Connect(url) // BAD: new connection every time4 defer nc.Close()5 nc.Publish("events", data)6}7
8// Reuse a long-lived connection:9var nc *nats.Conn10
11func init() {12 nc, _ = nats.Connect(url) // Connection created once13}14
15func handleRequest() {16 nc.Publish("events", data) // Reuses existing connection17}Fix health check probes. Configure probes to use the NATS monitoring port (8222) instead of the client port (4222), or use the /healthz endpoint which doesn’t create a client connection:
1# Kubernetes: use HTTP health check instead of TCP2livenessProbe:3 httpGet:4 path: /healthz5 port: 82226 periodSeconds: 10Set client names on all connections. When every client sets a unique, identifiable name at connect time, diagnosing churn becomes trivial — you can see exactly which application and instance is churning:
1nc, _ := nats.Connect(url, nats.Name("order-service-pod-abc123"))Monitor connection churn as a first-class metric. Export total_connections from /varz and alert on its rate of change.
Implement connection lifecycle logging in clients. Add logging to your NATS client’s connect, disconnect, and reconnect callbacks so you can correlate churn with application-level events:
1nc, _ := nats.Connect(url,2 nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {3 log.Printf("NATS disconnected: %v", err)4 }),5 nats.ReconnectHandler(func(nc *nats.Conn) {6 log.Printf("NATS reconnected to %s", nc.ConnectedUrl())7 }),8)In a stable deployment, the connection rate should be near zero — clients connect and stay connected. The only expected new connections are application deployments (rolling restarts), new service instances scaling up, and occasional client reconnects from transient network issues. The default threshold of 500 connections per epoch is deliberately high to avoid false positives; if you’re hitting it, something is actively churning.
Yes. Each connection event triggers subscription interest propagation across cluster routes. High churn creates a constant stream of internal subscription updates between cluster members. This competes with application message forwarding for internal bandwidth and can increase message delivery latency for stable clients. The effect is proportional to the churn rate and the number of subscriptions per churning client.
A deployment rollout creates a brief spike of disconnections (old pods) followed by connections (new pods), typically lasting seconds to minutes. Pathological churn is sustained — the connection rate remains elevated long after any deployment completes. Check the time pattern: if the high churn started at a deployment time and lasted less than 10 minutes, it’s probably normal. If it’s ongoing, investigate.
NATS doesn’t have a built-in per-IP connection rate limit, but you can limit the total connections per account with max_connections in the account configuration. This caps the damage from a single account’s misbehaving clients. For IP-level rate limiting, use a firewall or network policy to throttle connection attempts to the NATS port.
If you have a legitimately large deployment with frequent, expected connection events (e.g., thousands of short-lived batch jobs), you can adjust the threshold. But first verify that the churn is actually expected and not a symptom of a fixable problem. Most organizations that think they need a higher threshold actually need to fix their connection lifecycle — switching to persistent connections or connection pooling eliminates the churn entirely.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community