Checks/OPT_SYS_012

NATS Subscription Churn: What It Means and How to Fix It

Severity
Info
Category
Errors
Applies to
System Improvement
Check ID
OPT_SYS_012
Detection threshold
subscription insert and remove operations exceed configured threshold (default: 10,000) per epoch

Subscription churn occurs when a NATS server processes an excessive number of subscription inserts and removes within a single collection epoch. High churn rates — above 10,000 operations per epoch by default — indicate that clients are rapidly subscribing and unsubscribing, wasting CPU on interest graph updates and propagating unnecessary subscription changes to cluster routes and gateways.

Why this matters

NATS maintains a trie-based interest graph that maps subjects to active subscribers. Every SUB and UNSUB operation modifies this data structure — inserting or removing entries, updating reference counts, and invalidating cached matches. The cost of a single subscription operation is negligible, but at 10,000+ operations per epoch, the aggregate CPU spent on interest graph maintenance becomes measurable and directly competes with message routing.

The cost extends beyond the local server. In clustered deployments, subscription interest is propagated to all peers over route connections. Each SUB and UNSUB generates a protocol message that every other server in the cluster must process and apply to its own interest graph. In a 5-server cluster, 10,000 subscription operations on one server generate 40,000 protocol messages across routes. Gateway connections amplify this further — subscription changes propagate across super-clusters when using interest-only gateway mode.

Subscription churn also degrades the subscription cache hit rate. NATS caches subject-to-subscriber matches to avoid re-evaluating the trie on every publish. When subscriptions change frequently, cached matches are invalidated, forcing the server to re-walk the trie for affected subjects. A healthy server maintains a cache hit rate above 95%. Sustained subscription churn can drop this below 80%, adding latency to every publish operation on subjects whose cached matches were invalidated.

Common causes

  • Per-request subscribe/unsubscribe pattern. Applications that create a temporary subscription for each incoming request — subscribe, wait for reply, unsubscribe — generate two subscription operations per request. At 5,000 requests/second, that’s 10,000 subscription operations per second. This is the most common cause and usually indicates the application isn’t using the built-in request-reply pattern (nc.Request()), which uses a multiplexed inbox subscription internally.

  • Reconnect storms re-subscribing all subscriptions. When a client reconnects to a NATS server, the client library re-sends all active subscriptions. If a client has 500 subscriptions and reconnects 20 times during a network disruption, that’s 10,000 subscription inserts on the server — plus the corresponding removes when the old connections are cleaned up. A fleet-wide network blip affecting hundreds of clients amplifies this to hundreds of thousands of operations.

  • Dynamic topic subscriptions in response to user activity. Applications that subscribe to user-specific subjects when a user logs in and unsubscribe when they log out create churn proportional to session turnover. A web application with 1,000 concurrent users and an average session length of 10 minutes generates ~100 subscribe/unsubscribe pairs per minute. During peak hours or deployment rollouts, this rate can spike dramatically.

  • Ephemeral JetStream consumers creating deliver subscriptions. Each ephemeral push consumer creates a subscription on its deliver subject when it binds and removes it when it’s garbage-collected. If application code creates and destroys ephemeral consumers frequently (e.g., one per batch job), each lifecycle generates subscription churn on the underlying NATS server.

  • Client library auto-unsubscribe behavior. Some client libraries support auto-unsubscribe after N messages (sub.AutoUnsubscribe(n)). Each auto-unsubscribe generates an UNSUB protocol message. If many subscriptions auto-unsubscribe simultaneously — for example, a batch of one-shot request listeners all completing at once — the churn spikes.

How to diagnose

Check the subscription statistics endpoint

Query the server’s subscription stats to see insert and remove counts:

Terminal window
curl -s http://localhost:8222/subsz | jq '{num_subscriptions: .num_subscriptions, num_inserts: .num_inserts, num_removes: .num_removes, num_cache: .num_cache, cache_hit_rate: .cache_hit_rate}'

Compare num_inserts and num_removes over time. If both values are increasing at thousands per second, the server has active subscription churn.

Check the cache hit rate

A healthy subscription cache hit rate is above 95%:

Terminal window
curl -s http://localhost:8222/subsz | jq '.cache_hit_rate'

A cache hit rate below 80% combined with high insert/remove counts confirms that churn is degrading routing performance.

Identify churning connections

Find connections with high subscription counts or rapid subscription activity:

Terminal window
nats server report connections --sort subs

Connections that show fluctuating subscription counts across consecutive reports are the likely sources. Also check for connections with unusually high in_msgs relative to their subscription count — this pattern often indicates per-request subscribe/unsubscribe behavior.

Check connection churn correlation

Subscription churn often correlates with connection churn. If clients are reconnecting frequently, each reconnection replays all subscriptions:

Terminal window
nats server request connections --sort idle

Look for connections with very short idle times (seconds), indicating recent reconnections.

Monitor route subscription propagation

In clustered deployments, check how much subscription traffic is flowing over routes:

Terminal window
curl -s http://localhost:8222/routez | jq '.routes[] | {remote_id: .remote_id, in_msgs: .in_msgs, out_msgs: .out_msgs, subscriptions: .subscriptions_list | length}'

High in_msgs/out_msgs on route connections with relatively few active subscriptions suggests churn propagation.

How to fix it

Immediate: determine the diagnostic path

Excessive subscription insert and remove operations can stem from two distinct causes that require different remediation:

  1. Single client responsible — likely a misbehaving application that subscribes/unsubscribes in a loop. Identify it via connection name or IP using nats server report connections --sort subs, and fix the client code (see below).
  2. Many clients responsible — likely a reconnection storm. Clients reconnecting simultaneously re-subscribe all at once. Check for a preceding network event or server restart that triggered mass reconnection (see the “Short-term” section below).

Fix per-request subscription patterns

Use the built-in request-reply pattern. Most NATS client libraries implement request-reply using a single multiplexed inbox subscription (_INBOX.>) that handles all reply routing internally — no per-request subscribe/unsubscribe:

1
// Go — WRONG: manual per-request subscription
2
reply := nats.NewInbox()
3
sub, _ := nc.SubscribeSync(reply)
4
nc.PublishRequest("orders.validate", reply, orderData)
5
msg, _ := sub.NextMsg(5 * time.Second)
6
sub.Unsubscribe() // Churn!
7
8
// Go — RIGHT: built-in request (multiplexed inbox, no churn)
9
msg, err := nc.Request("orders.validate", orderData, 5*time.Second)
1
# Python — WRONG: manual subscription per request
2
reply = nc.new_inbox()
3
sub = await nc.subscribe(reply)
4
await nc.publish_request("orders.validate", reply, order_data)
5
msg = await sub.next_msg(timeout=5)
6
await sub.unsubscribe() # Churn!
7
8
# Python — RIGHT: built-in request
9
msg = await nc.request("orders.validate", order_data, timeout=5)

Short-term: reduce reconnection-driven churn

Implement exponential backoff on reconnect. Prevent all clients from reconnecting simultaneously after a network disruption:

1
// Go — configure reconnect with jitter
2
nc, err := nats.Connect(url,
3
nats.MaxReconnects(-1), // Unlimited reconnects
4
nats.ReconnectWait(2*time.Second), // Base wait
5
nats.ReconnectJitter(1*time.Second, 5*time.Second), // Random jitter
6
nats.CustomReconnectDelay(func(attempts int) time.Duration {
7
return time.Duration(math.Min(float64(attempts)*2, 30)) * time.Second
8
}),
9
)

Consolidate subscriptions. If a client subscribes to 100 specific subjects like orders.us.east.1, orders.us.east.2, etc., consider a single wildcard subscription orders.us.east.* with client-side filtering. One subscription replacing 100 reduces reconnection churn by 99x for that client.

Long-term: design for subscription stability

Use durable JetStream consumers instead of ephemeral subscriptions. Durable consumers maintain their state across client disconnections. The subscription is re-bound on reconnect, but the consumer itself doesn’t need to be recreated:

1
// Go — durable pull consumer (survives client restarts)
2
js, _ := nc.JetStream()
3
4
sub, _ := js.PullSubscribe(
5
"orders.>",
6
"order-processor", // Durable name — persists server-side
7
nats.BindStream("ORDERS"),
8
)
9
10
// On reconnect, the consumer already exists — no churn
11
msgs, _ := sub.Fetch(10, nats.MaxWait(5*time.Second))
1
// TypeScript — durable consumer
2
import { connect } from "nats";
3
4
const nc = await connect({ servers: "nats://localhost:4222" });
5
const js = nc.jetstream();
6
const jsm = await nc.jetstreamManager();
7
8
// Create durable consumer once
9
await jsm.consumers.add("ORDERS", {
10
durable_name: "order-processor",
11
filter_subject: "orders.>",
12
ack_policy: "explicit",
13
});
14
15
// Bind to existing consumer on each connect — no subscription churn
16
const consumer = await js.consumers.get("ORDERS", "order-processor");
17
const messages = await consumer.consume();

Implement connection pooling. Instead of each goroutine or thread opening its own NATS connection with its own subscriptions, share a connection pool. Fewer connections mean fewer subscription replays on reconnect and fewer total subscriptions to maintain.

Monitor subscription churn as a deployment health signal. Treat sustained churn above the threshold as a code smell — it almost always indicates a subscription lifecycle pattern that should be refactored. Add churn metrics to your CI/CD validation for load tests.

Frequently asked questions

What is a normal subscription churn rate?

In a healthy NATS deployment, subscription operations should be dominated by initial connection setup and rare reconnection events. A server handling 1,000 persistent connections with 10 subscriptions each should see roughly 10,000 inserts at startup and near-zero ongoing churn. If the server consistently processes thousands of insert/remove operations per collection epoch (typically 30-60 seconds), something is creating and destroying subscriptions in a loop.

Does subscription churn affect message delivery latency?

Yes. Each subscription change invalidates portions of the subscription routing cache. When the cache miss rate increases, the server must re-evaluate the subject trie for each publish to affected subjects, which adds microseconds to tens of microseconds per publish depending on trie complexity. At high publish rates, this adds measurable tail latency.

Is subscription churn the same as connection churn?

No, but they’re often correlated. Connection churn (CLUSTER_006) measures client connect/disconnect rate. Subscription churn measures subscribe/unsubscribe rate. A single connection with a per-request subscribe/unsubscribe pattern can generate massive subscription churn with zero connection churn. Conversely, a reconnect storm generates both connection churn and subscription churn (because reconnecting clients replay all their subscriptions).

How does subscription churn affect gateway interest-only mode?

In interest-only gateway mode, the local cluster tells remote clusters exactly which subjects have local subscribers. Subscription churn triggers interest updates across gateways — each new subscription may send an interest notification to every remote cluster, and each unsubscribe may send a no-interest notification. High churn in interest-only mode generates significant cross-cluster control traffic.

Can I see which subjects are churning?

Not directly from the /subsz endpoint, which shows aggregate counts. To identify specific subjects, capture subscription protocol messages from the client connection. Enable server trace logging temporarily on a suspect connection:

Terminal window
# Enable trace for a specific connection (by CID) via the server signal
nats server request connections --cid <cid> --trace

Look for patterns of repeated SUB/UNSUB on the same subject or inbox prefix.

Proactive monitoring for NATS subscription churn with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel