Checks/CLUSTER_006

NATS Connection Churn High: What It Means and How to Fix It

Severity
Warning
Category
Errors
Applies to
Cluster
Check ID
CLUSTER_006
Detection threshold
connection delta exceeds 500 per collection epoch

High connection churn means the cluster’s total connection count changed significantly between collection epochs — either a burst of new connections, a sudden mass disconnect, or rapid cycling of clients connecting and disconnecting in tight loops. The default threshold flags an absolute delta of more than 500 connections per epoch in either direction. Both spikes and sudden drops indicate that something is disrupting normal client lifecycle, whether that is reconnect loops, authentication failures, slow consumer evictions, an outage in a downstream client population, or a network event taking out a wave of clients at once.

Why this matters

Every new NATS connection has a cost. The server performs a TCP handshake, negotiates TLS (if configured), authenticates the client, processes subscription registrations, and updates the internal subscription interest graph. In a healthy deployment, this cost is amortized over long-lived connections — clients connect once and stay connected for hours, days, or weeks. When connections churn rapidly, this setup cost is paid repeatedly, consuming CPU cycles and memory allocations that should be serving messages.

The log noise is a second-order problem. Each connect and disconnect generates log entries. At 500+ connections per epoch, the server log becomes a wall of connection events, burying the meaningful operational signals (slow consumers, route failures, JetStream errors) in noise. Operators searching logs during an incident must filter through thousands of connection entries to find what matters. In severe cases, the log volume itself becomes a performance issue if logs are written synchronously or shipped to a centralized system.

Connection churn also destabilizes the subscription interest graph. Every time a client connects and subscribes, the server propagates that interest to all cluster peers via routes. When the client disconnects seconds later, the server propagates the removal. Hundreds of clients churning simultaneously create a constant stream of subscription interest updates flowing between cluster members, consuming internal bandwidth that should carry application messages. This can measurably increase message delivery latency for stable, well-behaved clients sharing the same cluster.

Common causes

  • Misconfigured client reconnect backoff. The default NATS client reconnect uses randomized jitter, but some applications override this with zero or near-zero backoff, creating a tight reconnect loop. When a client disconnects and immediately reconnects without delay, the server pays the full connection setup cost repeatedly.

  • Authentication failures causing immediate disconnect after connect. A client with invalid credentials connects, gets rejected, and immediately retries. Without exponential backoff, this creates a tight loop of failed connection attempts. Each attempt consumes server resources even though it fails. Common when credentials are rotated but not all clients are updated.

  • Slow consumer evictions triggering reconnect loops. When the server disconnects a slow consumer, the client library’s auto-reconnect kicks in and establishes a new connection. If the underlying throughput mismatch isn’t fixed, the client gets evicted again, reconnects again, and the cycle repeats. Each eviction-reconnect cycle counts as connection churn.

  • Load balancer health checks creating ephemeral connections. Kubernetes liveness/readiness probes, load balancer health checks, or monitoring tools that open a TCP connection to the NATS port, verify it’s listening, and immediately close. At high probe frequency across many targets, this generates significant churn.

  • Network instability causing mass reconnects. A network blip disconnects many clients simultaneously. They all reconnect at once, creating a spike in new connections. If the network remains unstable, the cycle repeats. This pattern shows up as periodic churn spikes correlated with network events.

  • Short-lived processes creating new connections per request. Serverless functions, batch jobs, or request-scoped microservices that create a new NATS connection for each invocation instead of reusing a persistent connection. Each function invocation = one connect + one disconnect.

How to diagnose

Confirm the churn rate

Check the connection delta between observation intervals:

Terminal window
nats server list

Compare the Conns column across repeated observations. A rapidly changing count indicates churn. For more precise measurement:

Terminal window
# Check total lifetime connections vs current
curl -s http://<server-host>:8222/varz | jq '{connections, total_connections}'

The total_connections field is a monotonically increasing counter. A large delta over a short period confirms high churn.

Identify which clients are churning

Examine recently closed connections:

Terminal window
curl -s "http://<server-host>:8222/connz?state=closed&limit=50&sort=stop" | jq '.connections[] | {cid, name, reason, stop, rtt, ip}'

Key fields:

  • name — Client name (if set). Patterns here reveal which application is churning.
  • reason — Why the connection closed. Common reasons: Slow Consumer, Authentication Failure, Client Closed, Stale Connection.
  • stop — When the connection ended. Rapid-fire timestamps indicate a single client reconnecting repeatedly.
  • ip — Source IP. Many connections from the same IP suggest a single host with a reconnect loop.

Watch connection events in real time

Terminal window
nats sub '$SYS.ACCOUNT.*.CONNECT' # publish-side: $SYS.ACCOUNT.<acct>.DISCONNECT

This streams connection and disconnection events as they happen. Look for patterns: the same client name appearing repeatedly, the same IP cycling connections, or bursts of disconnections followed by bursts of connections.

Check for authentication failures

Terminal window
# Search server logs for auth failures
journalctl -u nats-server --since "30 minutes ago" | grep -i "auth"

Repeated Authorization Violation entries from the same source confirm an auth-driven churn loop.

Correlate with slow consumer events

Terminal window
curl -s 'http://localhost:8222/connz?state=closed&sort=stop&limit=50' | jq '.connections[] | select(.reason | test("Slow"))'

If slow consumer counts are also elevated, the churn is likely driven by slow consumer evictions triggering reconnects.

How to fix it

Immediate: identify whether connections spiked or dropped

A dramatic connection count change was detected. The first decision is whether you’re looking at a connection spike (clients pouring in) or a connection drop (clients leaving en masse) — the diagnosis path differs:

  • Spike — investigate client-side reconnect storms, deployment rollouts that created many short-lived connections, misconfigured reconnect backoff (default is randomized jitter, not zero), authentication failures triggering immediate retries, slow consumer evictions creating reconnect loops, or load balancer health checks opening ephemeral connections.
  • Drop — investigate why a population of clients went away. Common causes are an outage in a downstream service that disconnected its NATS clients, a network event between the cluster and a client subnet, a credential rotation that revoked a batch of users, or a deployment that drained pods without bringing new ones up.
  • Both — clients churning in tight loops produce both: high disconnect and high reconnect counts in the same epoch. Pair connz?state=closed with the live connz to confirm.

Review client connection URLs and reconnect settings, then proceed to the diagnose-and-stabilize steps below.

If a specific client or IP is generating the bulk of churn, block it temporarily while you fix the root cause:

1
# Server config: limit connections per account to stop runaway clients
2
accounts {
3
APP {
4
users = [{user: app, password: pass}]
5
max_connections: 100
6
}
7
}

If authentication failures are the cause, fix the credentials. If the old credentials can’t be updated immediately, temporarily allow the old credentials alongside the new ones to stop the churn while you plan a coordinated rollover.

Short-term: fix client behavior

Add exponential backoff to reconnection logic. Most NATS client libraries support configurable reconnect behavior. Ensure clients use jitter and backoff to prevent thundering herd reconnects:

1
// Go client — configure reconnect with backoff
2
nc, err := nats.Connect(url,
3
nats.MaxReconnects(-1), // Reconnect forever
4
nats.ReconnectWait(2 * time.Second), // Base wait time
5
nats.CustomReconnectDelay(func(attempts int) time.Duration {
6
// Exponential backoff with jitter, max 60s
7
delay := time.Duration(math.Min(
8
float64(time.Second)*math.Pow(2, float64(attempts)),
9
float64(60*time.Second),
10
))
11
jitter := time.Duration(rand.Int63n(int64(time.Second)))
12
return delay + jitter
13
}),
14
)
1
# Python (nats.py) — reconnect options
2
nc = await nats.connect(
3
servers=["nats://s1:4222", "nats://s2:4222"],
4
max_reconnect_attempts=-1,
5
reconnect_time_wait=2, # seconds between attempts
6
)

Use persistent connections. For serverless or request-scoped workloads, maintain a connection pool or singleton connection that persists across invocations:

1
// Instead of creating a new connection per request:
2
func handleRequest() {
3
nc, _ := nats.Connect(url) // BAD: new connection every time
4
defer nc.Close()
5
nc.Publish("events", data)
6
}
7
8
// Reuse a long-lived connection:
9
var nc *nats.Conn
10
11
func init() {
12
nc, _ = nats.Connect(url) // Connection created once
13
}
14
15
func handleRequest() {
16
nc.Publish("events", data) // Reuses existing connection
17
}

Fix health check probes. Configure probes to use the NATS monitoring port (8222) instead of the client port (4222), or use the /healthz endpoint which doesn’t create a client connection:

1
# Kubernetes: use HTTP health check instead of TCP
2
livenessProbe:
3
httpGet:
4
path: /healthz
5
port: 8222
6
periodSeconds: 10

Long-term: design for stability

Set client names on all connections. When every client sets a unique, identifiable name at connect time, diagnosing churn becomes trivial — you can see exactly which application and instance is churning:

1
nc, _ := nats.Connect(url, nats.Name("order-service-pod-abc123"))

Monitor connection churn as a first-class metric. Export total_connections from /varz and alert on its rate of change.

Implement connection lifecycle logging in clients. Add logging to your NATS client’s connect, disconnect, and reconnect callbacks so you can correlate churn with application-level events:

1
nc, _ := nats.Connect(url,
2
nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
3
log.Printf("NATS disconnected: %v", err)
4
}),
5
nats.ReconnectHandler(func(nc *nats.Conn) {
6
log.Printf("NATS reconnected to %s", nc.ConnectedUrl())
7
}),
8
)

Frequently asked questions

What’s a normal connection rate for a NATS server?

In a stable deployment, the connection rate should be near zero — clients connect and stay connected. The only expected new connections are application deployments (rolling restarts), new service instances scaling up, and occasional client reconnects from transient network issues. The default threshold of 500 connections per epoch is deliberately high to avoid false positives; if you’re hitting it, something is actively churning.

Does connection churn affect message delivery for other clients?

Yes. Each connection event triggers subscription interest propagation across cluster routes. High churn creates a constant stream of internal subscription updates between cluster members. This competes with application message forwarding for internal bandwidth and can increase message delivery latency for stable clients. The effect is proportional to the churn rate and the number of subscriptions per churning client.

How do I distinguish between a deployment rollout and pathological churn?

A deployment rollout creates a brief spike of disconnections (old pods) followed by connections (new pods), typically lasting seconds to minutes. Pathological churn is sustained — the connection rate remains elevated long after any deployment completes. Check the time pattern: if the high churn started at a deployment time and lasted less than 10 minutes, it’s probably normal. If it’s ongoing, investigate.

Can rate limiting on the server prevent churn from consuming resources?

NATS doesn’t have a built-in per-IP connection rate limit, but you can limit the total connections per account with max_connections in the account configuration. This caps the damage from a single account’s misbehaving clients. For IP-level rate limiting, use a firewall or network policy to throttle connection attempts to the NATS port.

Should I increase the churn threshold if my deployment is large?

If you have a legitimately large deployment with frequent, expected connection events (e.g., thousands of short-lived batch jobs), you can adjust the threshold. But first verify that the churn is actually expected and not a symptom of a fixable problem. Most organizations that think they need a higher threshold actually need to fix their connection lifecycle — switching to persistent connections or connection pooling eliminates the churn entirely.

Proactive monitoring for NATS connection churn high with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel