Checks/OPT_BALANCE_003

NATS Subscription Hotspot: What It Means and How to Fix It

Severity
Info
Category
Saturation
Applies to
Balance
Check ID
OPT_BALANCE_003
Detection threshold
Subscriptions > 2× cluster average

A subscription hotspot occurs when one server in a NATS cluster carries more than double the cluster average number of subscriptions. This imbalance concentrates subject-matching CPU work and memory overhead on a single node, creating a bottleneck that limits cluster-wide throughput.

Why this matters

Every message published to a NATS cluster requires the server to match the subject against its subscription interest table. The more subscriptions a server holds, the more work it does per publish — even with NATS’s highly optimized subject trie. When subscriptions concentrate on one server, that server spends disproportionate CPU time on matching, while other servers sit underutilized. The bottleneck isn’t theoretical: at high message rates, a server with 50,000 subscriptions performs measurably more work than one with 5,000, and publish latency on the hot server increases accordingly.

The problem compounds with wildcard subscriptions. A single events.> subscription on a hot server matches every subject under events., which means the server evaluates that match for every publish to any events.* subject. If the hot server also holds the most connections, it becomes the chokepoint for both subscription matching and message delivery. Other servers in the cluster have spare capacity that goes unused.

Subscription hotspots also affect cluster resilience. If the overloaded server goes down, all those subscriptions must reestablish on remaining servers — potentially overloading them during the reconnection storm. What was a performance imbalance during steady state becomes a cascading failure during a disruption.

Common causes

  • Client connection configuration lists servers in a fixed order. Most NATS client libraries connect to the first reachable server in the URL list. If every client uses the same ordered list (e.g., nats://s1,s2,s3), the majority connect to s1 and bring their subscriptions with it. Without randomization, the first server absorbs the bulk of the subscription load.

  • Wildcard subscribers concentrated on one node. A monitoring or analytics service subscribing to > or *.> patterns runs on a single host that happens to connect to one server. That one wildcard subscription generates enormous fan-in on that server for every subject in the system.

  • Microservice deployments scaled unevenly. One service runs 20 replicas, each with 50 subscriptions, and all replicas land on the same server due to infrastructure affinity (same availability zone, same Kubernetes node, same DNS resolution).

  • Queue group subscribers not distributed. Queue groups balance message delivery, but if all members of the queue group connect to the same server, the subscription interest is still concentrated. The server must track each group member’s subscription individually.

  • Leafnode hub funneling subscriptions. A leafnode connection propagates all remote subscriptions to the hub server. If a single leafnode connects a large edge deployment with thousands of subscriptions, the hub server it connects to becomes a subscription hotspot.

How to diagnose

Check subscription counts per server

Terminal window
nats server report connections --sort subs

This shows total subscription count per server, sorted highest first. Compare the top server against the cluster average — if it’s more than 2× the mean, you have a hotspot.

Get detailed subscription routing information

Terminal window
# Per-server subscription stats
nats server request subscriptions --help
# Direct monitoring endpoint
curl -s http://localhost:8222/subsz?subs=1 | jq '.num_subscriptions'

The /subsz endpoint returns the subscription count and optionally the full subscription list. Compare across all servers to confirm the imbalance.

Identify which clients contribute the most subscriptions

Terminal window
nats server report connections --sort subs --account <account_name>

This breaks down per-client subscription counts. Look for clients with unusually high subscription counts or many clients from the same application clustered on one server.

Check for wildcard subscription concentration

Terminal window
curl -s http://localhost:8222/subsz?subs=1 | jq '.subscriptions_list[]' | grep '>'

Wildcard subscriptions (containing > or *) on the hot server are prime suspects. A single > subscription matches everything and generates maximum fan-in load.

Verify client connection distribution

Terminal window
nats server list

Compare connection counts across servers. If connections are also skewed, the subscription hotspot is likely a side effect of a connection hotspot (see OPT_BALANCE_002).

How to fix it

Immediate: redistribute existing connections

Force clients to reconnect with balanced distribution by performing a rolling restart or drain of the hot server:

Terminal window
# Drain the overloaded server — clients reconnect to other servers
nats-server --signal ldm=<server_name> # send SIGUSR2 to put the server in lame-duck mode

Draining gracefully migrates connections (and their subscriptions) to other cluster members. This is a temporary fix — clients may re-concentrate on reconnect if the underlying cause isn’t addressed.

Short-term: fix client connection configuration

Ensure all clients list every server in the cluster and enable randomization:

1
// Go client — list all servers, randomize by default
2
nc, err := nats.Connect(
3
"nats://s1:4222,nats://s2:4222,nats://s3:4222",
4
nats.DontRandomize(), // REMOVE this if present — randomize is on by default
5
)
1
# Python (nats.py) — list all servers
2
nc = await nats.connect(
3
servers=["nats://s1:4222", "nats://s2:4222", "nats://s3:4222"],
4
)
5
# Randomization is enabled by default in nats.py

If you’re using DNS-based discovery, ensure the DNS record returns all server IPs and that the client library randomizes the resolved addresses.

Short-term: reduce per-client subscription count

Applications that create many fine-grained subscriptions can often consolidate them with wildcards at the application level:

1
// Instead of 1,000 individual subscriptions:
2
// nc.Subscribe("orders.us.ny.12345", handler)
3
// nc.Subscribe("orders.us.ny.12346", handler)
4
// ...
5
6
// Use a wildcard and filter in the handler:
7
nc.Subscribe("orders.us.ny.*", func(msg *nats.Msg) {
8
orderID := extractOrderID(msg.Subject)
9
if shouldProcess(orderID) {
10
process(msg)
11
}
12
})

Fewer subscriptions per client means less concentration impact when clients aren’t perfectly distributed.

Long-term: implement connection-aware deployment

Design your deployment pipeline to distribute clients across servers deliberately:

  • Kubernetes: Use pod anti-affinity rules to spread service replicas across nodes, and configure each replica to prefer different NATS servers via environment-specific connection URLs.
  • DNS round-robin: Use a DNS record that returns NATS server addresses in random order. Most NATS client libraries also randomize the server list on connect.
  • Load balancer (with caveats): A TCP load balancer in front of the cluster distributes initial connections, but be aware that it adds a network hop and complicates TLS verification. NATS’s built-in cluster gossip and client randomization are usually sufficient.

Monitor subscription distribution as a standard operational metric. Alert when any server exceeds 1.5× the cluster average to catch imbalances before they become hotspots.

Frequently asked questions

How do subscription hotspots differ from connection hotspots?

A connection hotspot (OPT_BALANCE_002) means one server has disproportionately many client connections. A subscription hotspot means one server has disproportionately many subscriptions. They often co-occur — more connections typically means more subscriptions — but they can diverge. A server with few connections that each create hundreds of subscriptions (e.g., a monitoring service subscribing to events.> for every account) can be a subscription hotspot without being a connection hotspot.

Does NATS propagate subscriptions across cluster servers?

Yes. When a client subscribes on one server, that interest is propagated to all servers in the cluster via route connections. However, the server that holds the actual client connection does the final delivery and tracking work. The hotspot server bears the cost of maintaining the subscription state, matching incoming messages, and writing to client buffers — work that doesn’t transfer to other servers just because interest is propagated.

Can subscription hotspots cause slow consumer disconnections?

Indirectly, yes. A server spending excessive CPU on subscription matching may deliver messages to clients more slowly. If the delivery pipeline backs up, the server’s per-client outbound buffer fills, and the client gets disconnected as a slow consumer (SERVER_004). The root cause is the subscription imbalance, but the symptom appears as slow consumer events on the hot server.

Should I use a load balancer to distribute NATS connections?

Generally, no. NATS client libraries have built-in server randomization and cluster discovery that handle distribution without external infrastructure. A TCP load balancer adds latency, complicates TLS, and can mask server identity from clients. The better approach is to list all cluster server URLs in your client configuration and let the client library randomize. Use a load balancer only if you have specific network topology constraints that prevent direct client-to-server connectivity.

Proactive monitoring for NATS subscription hotspot with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel