Checks/LEAF_003

NATS Leafnode Subscription Count High: Preventing Hub Connection Timeouts

Severity
Warning
Category
Saturation
Applies to
Leafnode
Check ID
LEAF_003
Detection threshold
leafnode connection carries a subscription count that risks exceeding the hub's stale connection timeout

A leafnode subscription count high alert fires when a leafnode connection is carrying a large number of subscriptions. When a leafnode connects (or reconnects) to the hub, it sends its entire subscription list. If that list is large enough that the hub takes longer than 2 seconds to process it, the hub marks the connection as stale and drops it. The leafnode then reconnects, sends the same large subscription list, and the cycle repeats — creating a connection loop that never stabilizes.

Why this matters

Leafnode connections are the backbone of NATS multi-cluster and edge architectures. They bridge remote clusters, edge locations, and isolated environments back to the hub. When a leafnode connection can’t stabilize, the entire remote site loses connectivity to the rest of the NATS infrastructure.

The 2-second stale connection timeout is a hard boundary. The NATS server has an internal stale connection timeout (default 2 seconds) that kills connections that haven’t completed their initial setup in time. When a leafnode sends tens or hundreds of thousands of subscriptions during connection establishment, the hub must process each one — creating internal routing table entries, propagating interest to other cluster members, and updating its subscription cache. If this processing exceeds 2 seconds, the connection is dropped.

The failure mode is a silent loop. The leafnode reconnects automatically (as it should), sends the same subscription list, gets dropped again, and repeats. From the leafnode side, you see constant reconnection attempts. From the hub side, you see a stream of stale connection warnings. Neither side logs an obvious “your subscription count is too high” error — you have to connect the dots yourself.

All clients behind the leafnode are affected. A leafnode typically serves dozens or hundreds of local clients. When the leafnode connection to the hub is unstable, every one of those clients loses the ability to communicate with the broader NATS infrastructure. Messages to subjects on the hub side go undelivered. Request-reply patterns time out. JetStream consumers on the hub cannot receive acknowledgments from edge consumers.

The problem tends to grow over time. Applications add new subscriptions as features are developed. Each new microservice behind a leafnode adds its subscriptions to the leafnode’s aggregate count. What worked at launch with 1,000 subscriptions may break a year later at 50,000.

Common causes

  • Wildcard subscriptions propagating excessive interest. A leafnode configured with broad subject mappings (e.g., > or events.>) propagates every unique subscription from local clients to the hub. If 200 local clients each subscribe to 100 specific subjects, the leafnode sends 20,000 subscriptions to the hub at connect time.

  • Many microservices behind a single leafnode. Edge or branch deployments that run dozens of microservices locally, each with multiple subscriptions, accumulate a large aggregate subscription count on the single leafnode connection to the hub.

  • Dynamic subscription patterns. Applications that create subscriptions dynamically — per-session reply subjects, per-request inboxes, per-entity watch subjects — inflate the subscription count rapidly. Each active session or pending request adds one or more subscriptions.

  • Missing explicit exports/imports. Without account-level export/import configuration, the leafnode propagates all local subscriptions upstream. Explicit exports and imports act as a filter, sending only the subscriptions the hub actually needs to know about.

  • JetStream consumers adding subscription overhead. Each JetStream push consumer behind a leafnode creates a deliver subject subscription that propagates to the hub. A deployment with hundreds of push consumers adds hundreds of subscriptions to the leafnode connection.

How to diagnose

Check leafnode subscription counts

Terminal window
# List leafnode connections with subscription details
nats server report connections --sort subs --type leaf

Look for leafnode connections with subscription counts in the tens of thousands. The exact threshold where problems occur depends on hub server performance, but counts above 20,000–50,000 are in the danger zone.

Check for stale connection warnings on the hub

Terminal window
# Search hub server logs for stale connection events
grep -i "stale connection" /var/log/nats/nats-server.log
# Look for rapid reconnect patterns
grep -i "leafnode connection\|leaf remote" /var/log/nats/nats-server.log | tail -50

A pattern of repeated connect/disconnect events from the same leafnode, spaced 2–5 seconds apart, is the telltale sign of a subscription-count-induced connection loop.

Inspect what’s being subscribed

On the leafnode server, check the local subscription state:

Terminal window
# List all subscriptions on the leafnode server
nats server report connections --sort subs
# Check total subscriptions
nats server report accounts

Monitor programmatically

1
package main
2
3
import (
4
"encoding/json"
5
"fmt"
6
"io"
7
"net/http"
8
)
9
10
type Leafz struct {
11
Leafs []LeafInfo `json:"leafs"`
12
}
13
14
type LeafInfo struct {
15
Name string `json:"name"`
16
Account string `json:"account"`
17
NumSubs int `json:"subscriptions"`
18
IP string `json:"ip"`
19
Port int `json:"port"`
20
RTT string `json:"rtt"`
21
InMsgs int64 `json:"in_msgs"`
22
OutMsgs int64 `json:"out_msgs"`
23
}
24
25
func main() {
26
resp, err := http.Get("http://localhost:8222/leafz?subs=true")
27
if err != nil {
28
panic(err)
29
}
30
defer resp.Body.Close()
31
32
body, _ := io.ReadAll(resp.Body)
33
var leafz Leafz
34
json.Unmarshal(body, &leafz)
35
36
for _, leaf := range leafz.Leafs {
37
status := "OK"
38
if leaf.NumSubs > 20000 {
39
status = "WARNING"
40
}
41
if leaf.NumSubs > 50000 {
42
status = "CRITICAL"
43
}
44
fmt.Printf("[%s] Leaf %s (account: %s): %d subs, RTT: %s\n",
45
status, leaf.Name, leaf.Account, leaf.NumSubs, leaf.RTT)
46
}
47
}
1
import asyncio
2
import aiohttp
3
4
async def check_leaf_subs():
5
async with aiohttp.ClientSession() as session:
6
async with session.get("http://localhost:8222/leafz?subs=true") as resp:
7
data = await resp.json()
8
9
for leaf in data.get("leafs", []):
10
num_subs = leaf.get("subscriptions", 0)
11
name = leaf.get("name", "unknown")
12
status = "OK"
13
if num_subs > 20000:
14
status = "WARNING"
15
if num_subs > 50000:
16
status = "CRITICAL"
17
print(f"[{status}] Leaf {name}: {num_subs} subs, RTT: {leaf.get('rtt', 'N/A')}")
18
19
asyncio.run(check_leaf_subs())

How to fix it

Immediate: break the reconnection loop

Increase the stale connection timeout on the hub. This buys time for the hub to process the large subscription list without dropping the connection. This is a server-side configuration change:

1
# nats-server.conf (hub)
2
leafnodes {
3
port: 7422
4
# Increase stale timeout to handle large sub lists
5
# Note: This is a workaround — reduce subs long-term
6
}

Note: The stale connection timeout is not directly configurable in all NATS server versions. If it’s not tunable in your version, focus on reducing subscription count instead.

Short-term: reduce subscription propagation

Use explicit exports and imports. Instead of propagating all subscriptions across the leafnode, define exactly which subjects should cross the boundary:

1
# Hub server config
2
accounts {
3
EDGE {
4
exports: [
5
{ service: "api.>" }
6
{ stream: "events.>" }
7
]
8
imports: [
9
{ stream: { account: EDGE, subject: "telemetry.>" } }
10
]
11
}
12
}

This filters the subscription list to only the subjects that need to cross the leafnode boundary, potentially reducing thousands of subscriptions to dozens.

Consolidate subscriptions with wildcards. Instead of subscribing to thousands of specific subjects, use wildcard subscriptions and filter in the application:

1
// Before: 10,000 specific subscriptions (bad for leafnode)
2
for _, customerId := range customers {
3
nc.Subscribe("orders."+customerId+".created", handler)
4
}
5
6
// After: one wildcard subscription (good for leafnode)
7
nc.Subscribe("orders.*.created", func(msg *nats.Msg) {
8
// Extract customer ID from subject and route internally
9
tokens := strings.Split(msg.Subject, ".")
10
customerId := tokens[1]
11
routeToHandler(customerId, msg)
12
})

Switch push consumers to pull consumers. Pull consumers don’t create deliver subject subscriptions on the leafnode. They fetch messages on demand, eliminating the subscription overhead:

Terminal window
# Convert a push consumer to pull
nats consumer add ORDERS pull-processor \
--pull \
--filter "orders.>" \
--ack explicit

Long-term: architect for bounded subscription counts

Segment traffic across multiple leafnode connections. Instead of one leafnode connection carrying all subscriptions, use multiple leafnode connections with per-account isolation. Each connection carries only the subscriptions for its account:

1
# Leafnode server config
2
leafnodes {
3
remotes [
4
{
5
url: "nats-leaf://hub:7422"
6
account: "TELEMETRY"
7
}
8
{
9
url: "nats-leaf://hub:7422"
10
account: "ORDERS"
11
}
12
]
13
}

Monitor subscription growth as part of deployment reviews. Before deploying new services behind a leafnode, estimate the additional subscription count. Add it to your deployment checklist.

Set up alerts with Synadia Insights. Insights monitors leafnode subscription counts automatically and alerts when they approach dangerous thresholds, giving you time to act before the connection loop starts.

Frequently asked questions

What subscription count causes problems?

There’s no single number — it depends on hub server CPU speed, network latency, and what else the hub is doing during connection establishment. In practice, problems typically start appearing at 20,000–50,000 subscriptions on a single leafnode connection. Some deployments hit the timeout at lower counts if the hub is already under load.

Can I increase the hub’s processing speed instead of reducing subscriptions?

Running the hub on faster hardware helps but doesn’t solve the fundamental scaling issue. Subscription processing during connection establishment is largely single-threaded per connection. A faster CPU buys you a higher threshold but doesn’t eliminate it. Reducing subscription count is the sustainable fix.

Do queue group subscriptions count differently?

Each queue group subscription counts as one subscription from the leafnode’s perspective, regardless of how many local clients are in the queue group. If you have 100 clients in the same queue group, the leafnode sends one subscription (with the queue group name) rather than 100. This makes queue groups an effective way to reduce leafnode subscription count.

Will enabling leafnode compression help?

Compression (LEAF_001) reduces the bandwidth used by message payloads but doesn’t significantly help with the subscription processing timeout. The bottleneck during connection establishment is the hub’s processing time per subscription, not the time to transmit the subscription list over the wire. Compression helps with steady-state throughput, not connection setup.

How do I know if my leafnode is in a reconnection loop?

Check the leafnode server’s logs for rapid reconnection messages. You’ll see a pattern like: connect → subscribe → disconnect → reconnect, repeating every 2–5 seconds. On the hub side, you’ll see corresponding stale connection warnings. Monitoring the leafnode’s connection uptime (via /connz) will show very short connection durations.

Proactive monitoring for NATS leafnode subscription count high with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel