A connection hotspot occurs when one server in a NATS cluster handles significantly more client connections than its peers — more than double the cluster average — creating an uneven resource load across the cluster.
Every client connection consumes server resources. Each connection requires memory for its read/write buffers, CPU time for subscription matching and message routing, and file descriptors at the operating system level. At moderate connection counts, these costs are negligible. At hundreds or thousands of connections concentrated on a single server while its peers sit comparatively idle, the imbalance becomes a reliability concern.
The hotspot server becomes the weakest link in the cluster. It’s the first server to hit memory pressure, the first to experience elevated message latency, and the first to trigger slow consumer disconnections under load. When that server goes down — for maintenance, a crash, or a resource-constrained OOM kill — the blast radius is proportional to its connection share. If one server holds 60% of the cluster’s connections, losing that server forces 60% of all clients to reconnect simultaneously, creating a thundering herd on the remaining nodes.
Connection imbalance also undermines the purpose of clustering. A NATS cluster distributes load across multiple servers for both throughput and resilience. When connections concentrate on one node, you’re effectively running a single-server deployment with extra infrastructure costs. The underutilized servers aren’t contributing proportionally to throughput, and the cluster’s total capacity is limited by the hotspot server’s ceiling rather than the aggregate capacity of all nodes.
Client configuration listing a single server URL. If clients connect to nats://server1:4222 instead of nats://server1:4222,server2:4222,server3:4222, all connections go to one server. NATS client libraries support multi-URL connection strings and will distribute connections across listed servers, but only if given the option.
DNS round-robin not distributing evenly. DNS-based load distribution depends on client resolver behavior. Some resolvers cache the first result, some respect TTL inconsistently, and some operating systems prefer IPv4 over IPv6 entries. The result: uneven distribution despite a correctly configured DNS record.
Load balancer with sticky sessions. Placing a TCP load balancer with session affinity (sticky sessions) in front of NATS defeats the client library’s built-in distribution. Once a client is pinned to a server, it stays there — even if that server is already overloaded. NATS doesn’t need an external load balancer; the client libraries handle distribution natively.
Post-failover reconnection clustering. When a server goes down, its clients reconnect to the remaining servers. When the original server recovers, clients don’t automatically migrate back — they stay connected to whatever server accepted their reconnect. Over multiple failure events, connections accumulate on the most stable server.
Unequal server visibility. If some servers are not reachable from all clients — due to network segmentation, firewall rules, or misconfigured connection URLs — clients are forced onto the reachable subset, creating natural hotspots.
# Connection count per servernats server listCompare the Connections column across servers. The check triggers when any server has more than 2x the cluster average.
# Detailed connection report sorted by connection countnats server report connectionsThis shows which clients are connected to which servers, along with their names, message rates, and pending bytes. Look for patterns — are all connections from a specific service on one server?
# Check connection details on the hot servercurl -s http://localhost:8222/connz?limit=0 | jq '.connections[] | {name, ip, lang, version}' | sort | uniq -c | sort -rnGroup connections by client name, IP range, or language to identify which applications or deployment groups are contributing to the hotspot.
The connection details from /connz include the client’s reported server list. If most clients show a single server URL, the fix is client configuration:
curl -s http://localhost:8222/connz?limit=0 | jq '.connections[] | {name, ip, start}'Clients that all connected around the same time may have been deployed or restarted together, all resolving the same DNS entry.
The single most effective fix is ensuring every client lists all cluster servers in its connection URL:
1// Go client — list all cluster servers2nc, err := nats.Connect(3 "nats://server1:4222,nats://server2:4222,nats://server3:4222",4 nats.Name("order-processor"),5 nats.MaxReconnects(-1),6 nats.ReconnectWait(2 * time.Second),7)1# Python — list all cluster servers2import nats3
4nc = await nats.connect(5 servers=[6 "nats://server1:4222",7 "nats://server2:4222",8 "nats://server3:4222",9 ],10 name="order-processor",11 max_reconnect_attempts=-1,12)NATS client libraries randomize the server list by default, so clients naturally distribute across the listed servers on connect and reconnect.
Remove sticky load balancers. If a TCP load balancer sits in front of your NATS cluster, remove the session affinity configuration — or remove the load balancer entirely. NATS client libraries handle connection distribution better than external load balancers because they understand the cluster topology through the server’s INFO protocol message.
Why not DNS round-robin or a load balancer? The classic remediation for connection hotspots in HTTP services is to put a TCP load balancer or DNS round-robin record in front of the fleet. With NATS, both work as a fallback, but they are strictly worse than letting the client library handle distribution. A TCP load balancer adds a network hop, complicates TLS verification, and — if it does anything cleverer than round-robin — frequently re-introduces the hotspot via session affinity. DNS round-robin gives no client-side awareness of which servers are healthy or which are local. Cluster gossip via the
INFOprotocol gives every client live, accurate topology and randomizes connections across all known servers, including ones added after the client started. Use a load balancer only when client configuration is genuinely outside your control (static IoT devices, third-party clients) — and even then, configure it to be stateless round-robin without sticky sessions.
Rolling restart of affected clients. Once client connection strings are updated to list all servers, a rolling restart of the client application distributes connections across the cluster. The client library’s built-in randomization handles the distribution:
# Verify the new distribution after client restartsnats server listMonitor connection counts as clients restart. The distribution should converge toward even as more clients reconnect with the updated configuration.
Use the NATS cluster discovery protocol. NATS servers gossip cluster membership to clients through the INFO protocol message. Even if a client connects to a single server initially, it learns about all other servers and can use them for reconnection. Ensure your server configuration doesn’t disable this:
1# Server config — cluster block enables gossip2cluster {3 name: "my-cluster"4 listen: "0.0.0.0:6222"5 routes: [6 "nats-route://server1:6222"7 "nats-route://server2:6222"8 "nats-route://server3:6222"9 ]10}Monitor connection balance as a cluster health metric. Track the ratio of max-to-average connections per server. Alert when any server exceeds 2x the average. Synadia Insights automates this check and surfaces connection hotspots across your entire deployment, including per-account concentration analysis (OPT_BALANCE_006).
Standardize connection configuration. Provide a shared configuration template or environment variable for all services connecting to NATS. This prevents individual teams from hardcoding single-server URLs:
# Environment variable with all cluster serversexport NATS_URL="nats://server1:4222,nats://server2:4222,nats://server3:4222"Yes — if you provide multiple server URLs. All official NATS client libraries randomize the server list by default and select a random server for the initial connection. On reconnect, they try servers in random order. This built-in behavior provides effective load distribution without any external load balancer, but only if the client is configured with more than one server URL.
Not immediately. Existing connections to healthy servers remain in place. Only new connections and reconnections (from disconnected clients) will distribute across all servers including the recovered node. To force rebalancing, you can perform a rolling restart of client applications or, in some cases, use server-side connection draining to redistribute clients.
Generally, no. NATS client libraries provide better distribution than TCP load balancers because they understand the cluster topology. Load balancers add latency, a single point of failure, and often interfere with NATS protocol features like cluster discovery. The exception is environments where clients can’t be configured with multiple URLs (e.g., IoT devices with static configuration) — in that case, a TCP load balancer without sticky sessions is acceptable.
There’s no universal limit — it depends on server hardware, message rates, and subscription complexity. NATS servers routinely handle tens of thousands of connections. The check isn’t about absolute connection count; it’s about relative imbalance. A three-node cluster with 3,000, 3,100, and 2,900 connections is balanced. A cluster with 6,000, 1,500, and 1,500 connections is a hotspot, even though no individual server is necessarily overloaded.
Each connection adds a small amount of overhead for subscription interest tracking and message matching. At high connection counts (10,000+), the subscription matching overhead becomes measurable — more connections with overlapping subscriptions means more work per published message. Connection hotspots amplify this effect by concentrating the subscription matching load on one server instead of distributing it across the cluster.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community