Checks/OPT_SYS_011

NATS Subscription Fanout Anomaly: What It Means and How to Fix It

Severity
Info
Category
Consistency
Applies to
System Improvement
Check ID
OPT_SYS_011
Detection threshold
max fanout exceeds configured multiplier of average fanout (default: 10x) and average fanout is greater than 1

A subscription fanout anomaly occurs when max fanout is disproportionately higher than average fanout on a NATS server. A max-to-average fanout ratio exceeding 10x (the default threshold) indicates one or more subjects with excessive subscribers acting as broadcast hotspots, multiplying CPU and memory cost per published message.

Why this matters

NATS delivers messages by iterating over every matching subscriber for a given subject. If a subject has 500 subscribers and the server average is 5, publishing a single message to that subject costs 100x more CPU than a typical publish — the server must serialize the message into 500 outbound buffers, one per subscriber connection. This cost is paid on every publish to that subject, making it a sustained CPU multiplier that scales linearly with publish rate.

The memory impact compounds the CPU cost. Each subscriber maintains a pending outbound buffer. For a subject with 500 subscribers receiving 1,000 msg/s at 1KB per message, the server allocates up to 500 pending buffers that collectively consume memory proportional to subscriber count × message rate × message size. Under load, this is precisely the pattern that triggers slow consumer disconnections (SERVER_004) — some of those 500 subscribers inevitably fall behind, and the server spends additional resources buffering messages for clients that are about to be disconnected anyway.

The anomaly is often invisible during normal development and testing. A subject with 5 subscribers in staging behaves identically to one with 500 in production — the only difference is the per-message cost multiplier. Teams discover the problem when CPU spikes during traffic peaks, when slow consumer events appear on specific servers, or when one server in a cluster uses significantly more CPU than its peers (because clients subscribing to the hot subject happen to be concentrated on that server).

Common causes

  • Wildcard subscriptions matching too broadly. A subscriber on events.> receives every message published to any subject starting with events.. If 200 microservices each subscribe to events.> for their own logging or auditing, every publish to any events.* subject fans out to all 200. The intent is usually “each service gets its own events” — the implementation delivers all events to every service.

  • Missing queue groups for work distribution. When multiple instances of the same service all subscribe to the same subject without using a queue group, every instance receives every message. Three replicas of an order processor each subscribed to orders.new creates 3x fanout. With a queue group, NATS delivers each message to exactly one instance — the intended behavior for work distribution.

  • Monitoring or audit subscriptions duplicated per instance. A sidecar or monitoring agent subscribing to > (all subjects) on every pod in a Kubernetes deployment creates fanout proportional to pod count. A 100-pod deployment generates 100x fanout on every subject, even though each monitor only needs to sample traffic.

  • Shared notification subjects without partitioning. A pattern like notifications.user.> where every connected user’s client subscribes creates fanout proportional to the user base. If 10,000 users are online and a system-wide notification publishes to notifications.user.*, the server fans out to all 10,000 connections.

  • Cached subscription state after service restarts. If clients reconnect without cleaning up old subscriptions (or if the server retains subscription interest from routes), phantom fanout can accumulate. The effective subscriber count grows with each reconnection cycle.

How to diagnose

Check server subscription statistics

Query the server’s subscription routing information:

Terminal window
curl -s http://localhost:8222/subsz?subs=1 | jq '{num_subscriptions: .num_subscriptions, num_cache: .num_cache, num_inserts: .num_inserts, num_matches: .num_matches, cache_hit_rate: .cache_hit_rate}'

Identify the high-fanout subjects

List detailed subscription information to find subjects with anomalous subscriber counts:

Terminal window
curl -s http://localhost:8222/subsz?subs=1 | jq '.subscriptions_list | sort_by(-.num) | .[0:10]'

This returns the top 10 subjects by subscriber count. Compare the highest count to the average to confirm the anomaly.

Check connection-level subscription counts

Find clients with excessive subscriptions that may be contributing to fanout:

Terminal window
nats server report connections --sort subs

Connections with hundreds or thousands of subscriptions are likely using broad wildcard patterns.

Measure CPU impact

Compare CPU usage across cluster servers to identify if the fanout is concentrated:

Terminal window
nats server report jetstream

If one server has significantly higher CPU than its peers, check whether the high-fanout subject’s subscribers are concentrated on that server.

Test actual fanout for a specific subject

Publish a test message and observe delivery:

Terminal window
# In separate terminals, subscribe to see who gets the message
nats sub "events.test" --count 1
# Publish a test message
nats pub "events.test" "fanout-test"

The subscriber count shown in the publish confirmation reveals the actual fanout for that subject.

How to fix it

Immediate: identify and reduce the largest fanout

Investigate subjects with high subscriber counts. A large max-to-average fanout ratio indicates one or more subjects with excessive subscribers, which can create hot spots. Identify these subjects first, then apply the appropriate fix.

Add queue groups to work-distribution subscribers. If multiple instances of the same service subscribe to the same subject for processing (not for broadcast), add a queue group:

1
// Go — queue subscription for work distribution
2
// Before: every instance gets every message
3
// sub, _ := nc.Subscribe("orders.new", handler)
4
5
// After: NATS delivers each message to one instance in the group
6
sub, _ := nc.QueueSubscribe("orders.new", "order-processors", func(msg *nats.Msg) {
7
processOrder(msg.Data)
8
})
1
# Python — queue subscription
2
# Before: nc.subscribe("orders.new", cb=handler)
3
4
# After: one delivery per message across the group
5
await nc.subscribe("orders.new", queue="order-processors", cb=handler)

Remove duplicate monitoring subscriptions. If monitoring sidecars don’t need every message, sample instead:

Terminal window
# Instead of subscribing to everything
# nats sub ">"
# Subscribe to a specific monitoring subject
nats sub "$SYS.SERVER.*.STATSZ"

Short-term: narrow wildcard subscriptions

Replace broad wildcards with specific subjects. Audit subscribers using > or multi-level wildcards and narrow them to the subjects they actually need:

1
// Before: receives ALL events across all services
2
// nc.Subscribe("events.>", handler)
3
4
// After: receives only order events
5
sub, _ := nc.Subscribe("events.orders.>", func(msg *nats.Msg) {
6
handleOrderEvent(msg.Data)
7
})

Partition broadcast subjects. If a subject genuinely needs broadcast semantics to many subscribers, partition by a key to distribute the fanout:

1
// Before: one subject, 10,000 subscribers
2
// nc.Publish("notifications.all", data)
3
4
// After: partition by user region, 10 partitions × 1,000 subscribers each
5
region := getUserRegion(userID)
6
nc.Publish(fmt.Sprintf("notifications.%s", region), data)

Long-term: redesign subject hierarchy for bounded fanout

Establish fanout budgets. Define maximum expected fanout per subject tier in your naming convention. Example:

Subject patternExpected fanoutMechanism
orders.*1 (queue group)Work distribution
events.*.broadcast10-50Known broadcast
$SYS.>1-3Monitoring only

Use JetStream for high-fanout data flows. Instead of core NATS pub/sub with hundreds of subscribers, publish to a JetStream stream and let each consumer group process independently. The stream absorbs the write once; consumers read at their own pace without multiplying server-side delivery cost:

1
// Publish once to JetStream
2
js, _ := nc.JetStream()
3
js.Publish("events.orders.created", orderData)
4
5
// Each service creates its own consumer — no fanout multiplication
6
sub, _ := js.PullSubscribe("events.orders.created",
7
"analytics-consumer",
8
nats.BindStream("EVENTS"),
9
)
1
# Python — JetStream consumer per service
2
js = nc.jetstream()
3
await js.publish("events.orders.created", order_data)
4
5
# Each service has its own pull consumer
6
psub = await js.pull_subscribe(
7
"events.orders.created",
8
durable="analytics-consumer",
9
stream="EVENTS",
10
)

Frequently asked questions

What fanout ratio is considered normal?

It depends on your architecture. An average fanout of 1-3 is typical for microservice deployments using queue groups. A max fanout of 10-20 is reasonable for broadcast subjects like configuration updates or health checks. The check fires when the max-to-average ratio exceeds 10x — meaning one subject has dramatically more subscribers than the rest of the system. If your average is 2 and your max is 25, the ratio is 12.5x, which triggers the check even though 25 subscribers isn’t inherently problematic.

Does subscription fanout affect JetStream streams?

Not directly. JetStream stream writes are handled by the Raft group, not the subscription routing engine. However, if JetStream consumers use push delivery (deliver subject), each push consumer counts as a subscriber to its deliver subject. A stream with 50 push consumers creates fanout on the deliver subjects. Pull consumers avoid this because the client initiates the fetch.

How does fanout interact with cluster routes?

In a NATS cluster, subscription interest is propagated across routes. If 100 clients on Server A subscribe to events.> and a message is published on Server B, Server B sends one copy across the route to Server A, which then fans out locally to 100 clients. The route itself only carries one copy — fanout is always local to the server where subscribers are connected. This means fanout cost is concentrated, not distributed.

Can I limit the maximum number of subscribers on a subject?

NATS does not support per-subject subscriber limits. You can limit total subscriptions per account (via account limits) or per connection, but there’s no mechanism to say “subject X allows at most N subscribers.” The architectural solutions — queue groups, subject partitioning, JetStream consumers — are more effective than artificial limits would be.

Will reducing fanout break existing subscribers?

Adding queue groups changes delivery semantics: instead of every instance receiving every message, each message goes to one instance. This is correct for work distribution but breaks broadcast use cases. Before adding a queue group, verify that the subscribers are processing messages (not just observing them). For monitoring and audit subscribers that genuinely need every message, keep them as plain subscriptions but narrow their wildcard scope.

Proactive monitoring for NATS subscription fanout anomaly with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel