Checks/OPT_BALANCE_006

NATS Account Connection Concentration: What It Means and How to Fix It

Severity
Info
Category
Saturation
Applies to
Balance
Check ID
OPT_BALANCE_006
Detection threshold
> 70% of account connections on one server (minimum 3 servers, minimum 10 connections)

Account connection concentration means more than 70% of a single account’s client connections are on one server in a cluster with at least 3 servers and at least 10 connections. This creates a single point of failure for that account — if the server goes down, the account loses the vast majority of its connectivity in one event rather than a proportional fraction.

Why this matters

NATS clusters are designed so that losing any single server degrades capacity proportionally. If you have 3 servers and lose one, you lose roughly one-third of your connection capacity. But when 70%+ of an account’s connections are on one server, that account experiences a near-total outage when that specific server goes down — even though the cluster as a whole lost only one-third of its capacity. The account’s services disconnect en masse, triggering a reconnection storm as clients scramble to establish connections on the surviving servers.

The reconnection storm itself is the second problem. When hundreds of clients from the same account reconnect simultaneously, they re-subscribe to all their subjects, re-authenticate, and potentially trigger JetStream consumer redeliveries. The surviving servers absorb this spike on top of their existing load. If the account’s workload is substantial, this burst can temporarily overload the remaining servers, causing slow consumer events or connection rejections that affect other accounts sharing the cluster.

Multi-tenant clusters amplify the risk. If one account’s connection concentration is unknown to the operations team, capacity planning assumes even distribution. Load tests pass because they model balanced connections. Then a single server failure takes out a major tenant, and the blast radius surprises everyone.

Common causes

  • Hardcoded server address in application configuration. The application connects to a single server URL (e.g., nats://nats-1:4222) rather than listing all cluster members. Every instance of that application connects to the same server.

  • DNS resolution returning a single address. The NATS connection URL uses a hostname that resolves to one IP. Even if multiple servers exist behind different IPs, the DNS record only returns one, funneling all connections to it.

  • Infrastructure affinity. The account’s services run in the same availability zone, Kubernetes node, or network segment as one NATS server. Network proximity or routing rules cause all connections to prefer the closest server.

  • Sticky load balancer sessions. A load balancer in front of the NATS cluster uses session persistence (sticky sessions), routing all connections from the same source IP or client to the same backend server.

  • Small account with few services. An account with only 10-15 connections from a handful of services naturally concentrates if those services happen to connect to the same server. The concentration is real but may be acceptable at this scale.

How to diagnose

Check per-account connection distribution

Terminal window
# Account-level stats including per-server connection counts
curl -s http://localhost:8222/accstatz | jq '.account_statz[] | {account: .acc, conns: .conns}'

Query this endpoint on each server in the cluster to build a per-account, per-server connection matrix. If one server holds more than 70% of an account’s total connections, the account is concentrated.

Use the NATS CLI for server-level breakdown

Terminal window
nats server report connections --account <account_name>

This shows all connections for the specified account, grouped by server. Count the connections per server to confirm the imbalance.

Check overall account connection counts

Terminal window
nats server report accounts

This provides an aggregate view of accounts across the cluster, including connection counts. Identify accounts with significant connection counts, then drill into their per-server distribution.

Verify client connection URLs

Check the application configuration for the concentrated account. If clients specify a single server URL instead of the full cluster list, that’s the cause:

Terminal window
# In server logs, look for connections from the account
# and check which server they connected to
grep "cid:" /var/log/nats/nats-server.log | grep "<account_name>"

Test DNS resolution

If the application uses a DNS name for the NATS connection:

Terminal window
dig +short nats.example.com

If this returns a single IP instead of all cluster server IPs, DNS is the bottleneck.

How to fix it

Immediate: drain the concentrated server

Force clients to redistribute by draining the overloaded server:

Terminal window
nats-server --signal ldm=<server_name> # send SIGUSR2 to put the server in lame-duck mode

This gracefully disconnects all clients, which then reconnect to other servers using their connection URL list. If their connection URL only lists one server, they’ll reconnect to the same one — so this only works if the underlying configuration is also fixed.

Short-term: fix client connection configuration

Ensure every client in the account lists all cluster servers:

1
// Go — list all servers in the cluster
2
nc, err := nats.Connect(
3
"nats://s1:4222,nats://s2:4222,nats://s3:4222",
4
nats.Name("order-service"),
5
nats.UserInfo("account_user", "password"),
6
)
1
# Python (nats.py) — multiple servers with randomization
2
nc = await nats.connect(
3
servers=["nats://s1:4222", "nats://s2:4222", "nats://s3:4222"],
4
name="order-service",
5
user="account_user",
6
password="password",
7
)

NATS client libraries randomize the server list by default. Once all clients list all servers, connections naturally distribute across the cluster on the next deployment or restart cycle.

Short-term: fix DNS to return all servers

If clients use DNS-based discovery, update the DNS record to return all server IPs:

1
nats.example.com. 300 IN A 10.0.1.10
2
nats.example.com. 300 IN A 10.0.1.11
3
nats.example.com. 300 IN A 10.0.1.12

Use a short TTL (300 seconds) so clients pick up changes quickly. Most NATS client libraries resolve DNS at connect time and randomize the results.

Long-term: set per-account connection limits per server

Use NATS account configuration to limit how many connections a single account can establish. While this doesn’t directly distribute connections, it prevents any single account from monopolizing a server:

Terminal window
# Set account connection limit via nsc
nsc edit account <account_name> --conns 100

Combine with monitoring to track per-account distribution as an ongoing metric.

Build connection distribution into your operational runbook. During quarterly reviews, check that no account has more than 50% of its connections on any single server.

Frequently asked questions

How is account connection concentration different from a connection hotspot?

A connection hotspot (OPT_BALANCE_002) looks at total connections per server — one server has more connections than others regardless of which account they belong to. Account connection concentration (this check) looks at a specific account’s connections — most of that account’s clients are on one server. You can have account concentration without a server hotspot if the account is small relative to total cluster traffic.

Does NATS route messages to the account regardless of where its clients connect?

Yes. NATS propagates subscription interest across all cluster servers via route connections. Messages reach subscribers regardless of which server they’re connected to. The concentration problem isn’t about message delivery — it’s about resilience. If the server goes down, the account loses most of its clients at once, causing a service disruption far worse than losing a proportional fraction.

Should I worry about concentration for accounts with only a few connections?

The check requires at least 10 connections before it fires. Below that, concentration is expected — if you have 3 connections, they can’t distribute evenly across 3 servers by definition. Focus on accounts with meaningful connection counts (50+) where concentration creates real blast radius risk.

Can I use NATS leafnodes to distribute account connections?

Yes. Leafnodes can spread an account’s connection points across multiple hub servers. Each leafnode connects to a different hub server, and clients behind the leafnode access the cluster through that connection. This naturally distributes the account’s footprint — but adds operational complexity. It’s most useful when the account’s clients are geographically distributed and leafnodes serve double duty as regional access points.

What happens if the concentrated server restarts during normal maintenance?

All the account’s clients disconnect and reconnect. If they have proper multi-server connection URLs, they’ll reconnect to other servers and the concentration may actually improve after the restart. If they have single-server URLs, they’ll queue up waiting for the restarting server, causing an extended outage for that account. This is why fixing connection URLs is the most important remediation step.

Proactive monitoring for NATS account connection concentration with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel