Checks/CLUSTER_003

NATS High HA Assets: What It Means and How to Fix It

Severity
Warning
Category
Saturation
Applies to
Cluster
Check ID
CLUSTER_003
Detection threshold
ha_assets >= 1,000 per server

High HA assets means a NATS server has 1,000 or more highly-available (replicated) JetStream assets (streams and consumers). Each replicated asset requires its own Raft consensus group, and the cumulative overhead of too many Raft groups degrades cluster performance, increases leader election frequency, and slows meta cluster snapshots.

Why this matters

Every R3 stream creates one Raft group. Every R3 consumer on that stream creates another Raft group. A single R3 stream with 10 R3 consumers means 11 Raft groups — on each of the three servers hosting replicas. Multiply that across dozens or hundreds of streams, and a server can easily accumulate thousands of Raft groups, each independently running heartbeat timers, leader elections, and log replication.

The overhead is not linear — it’s worse than linear. Each Raft group sends periodic heartbeats to its peers. With 1,000 groups, that’s 1,000 sets of heartbeat timers firing independently, generating internal message traffic between servers. Under normal conditions this is manageable, but when the cluster experiences a disruption — a server restart, a network blip, a garbage collection pause — hundreds of Raft groups simultaneously trigger leader elections. The resulting burst of election traffic can saturate internal communication channels, causing elections to fail and retry, creating a cascade of instability that takes minutes to settle.

Meta cluster snapshots are the other bottleneck. The meta leader periodically snapshots the entire JetStream asset catalog. With thousands of HA assets, snapshot serialization takes longer, consuming CPU and I/O on the meta leader. If snapshots take too long (see META_004), new servers joining the cluster or catching up after a restart wait longer to become fully operational. In severe cases, the snapshot duration exceeds the Raft election timeout, triggering unnecessary meta leader elections.

Common causes

  • Many small streams with R3 replication. Organizations often default to R3 for all streams “just in case.” When the stream count grows into the hundreds, the Raft group overhead becomes significant — even if most streams are small, low-throughput, or rarely accessed. The replication overhead is per-stream, not per-byte.

  • High consumer count per stream. Each R3 consumer creates its own Raft group. A stream with 50 durable consumers means 50 additional Raft groups per stream. This is common in event-sourcing patterns where many independent services each maintain their own consumer on a shared stream.

  • Ephemeral consumers created as durable R3. Applications that should use ephemeral consumers (or R1 durable consumers) instead create R3 durable consumers. Each one persists with its Raft group even when the subscribing application disconnects, and the Raft groups accumulate over time.

  • No consolidation strategy. Without organizational guidelines, teams create one stream per data type, per service, per environment. Hundreds of single-subject streams that could be consolidated into multi-subject streams using subject hierarchies each carry their own Raft group overhead.

  • Organic growth without cleanup. Streams and consumers are created as needs arise but never removed when no longer needed. Over months, the HA asset count grows monotonically as abandoned streams and consumers accumulate.

How to diagnose

Check the HA asset count per server

Terminal window
nats server report jetstream

Look at the HA Assets column. Any server at or above 1,000 warrants investigation. Compare the count across servers — if all servers are high, the issue is total asset volume. If only some are high, placement is also uneven.

Break down streams by replica count

Terminal window
nats stream report

This shows every stream with its replica count and leader placement. Count how many R3 (or R5) streams exist. Look for streams that might not need replication — low-throughput, non-critical, or rebuildable data.

Count consumers per stream

For streams with high consumer counts:

Terminal window
nats consumer report <stream-name>

This shows all consumers on the stream, their replica counts, and their activity status. Look for consumers that are idle, fully drained, or no longer needed.

Check Raft group health indicators

Terminal window
nats server report jetstream --raft

If Raft apply lag or pending values are elevated across many groups, the server is struggling to keep up with the consensus overhead.

Estimate total Raft group count

Each R3 stream = 1 Raft group. Each R3 consumer = 1 Raft group. Total per server ≈ (R3 streams hosted × 1) + (R3 consumers hosted × 1). A server participating in 500 R3 streams with an average of 3 R3 consumers each is managing 2,000 Raft groups.

How to fix it

Immediate: reduce the most impactful Raft groups

Identify streams that can be downgraded from R3 to R1 immediately — non-critical, rebuildable, or low-value data:

Terminal window
# Change a stream from R3 to R1
nats stream edit <stream-name> --replicas 1

This immediately removes 2 Raft group memberships from the other servers. Start with inactive or low-throughput streams to reduce the count quickly.

Remove consumers that are no longer active:

Terminal window
# List consumers and their activity
nats consumer report <stream-name>
# Delete unused consumers
nats consumer rm <stream-name> <consumer-name>

Short-term: redistribute, consolidate, and right-size

Redistribute streams and consumers across more servers. If the cluster has servers with significantly fewer HA assets than the flagged server, move stream replicas onto them with placement tags. This shrinks each server’s Raft group footprint without changing replica counts or consumer behavior:

Terminal window
# Tag servers (on each server's config), then re-target the stream
nats stream edit <stream-name> --tag jetstream-pool-b

For R1 streams, scale to R3 temporarily so a replica lands on the underloaded server, then scale back to R1 after a step-down so the new leader stays there. This is most effective when the imbalance is concentrated on a few large streams; for many small streams, consolidation is faster.

Merge small single-subject streams. If you have streams like orders.created, orders.updated, orders.deleted each as separate R3 streams, consolidate them into a single orders stream with subject filtering:

Terminal window
# Create consolidated stream
nats stream add orders --subjects "orders.>" --replicas 3 --retention limits
# Consumers can filter by subject
nats consumer add orders order-created-processor --filter "orders.created" --pull

Three Raft groups become one (plus consumer groups).

Use R1 for ephemeral and rebuildable data. Streams storing cache data, derived views, or data that can be rebuilt from source don’t need replication:

1
// Go — create R1 stream for non-critical data
2
js, _ := nc.JetStream()
3
_, err := js.AddStream(&nats.StreamConfig{
4
Name: "cache-events",
5
Subjects: []string{"cache.>"},
6
Replicas: 1,
7
MaxAge: time.Hour * 24,
8
})

Convert unnecessary durable consumers to ephemeral. If a consumer doesn’t need to survive application restarts, use ephemeral consumers that clean up automatically:

1
// Ephemeral pull consumer — no Raft group, auto-deleted on inactivity
2
sub, _ := js.PullSubscribe("orders.>", "", nats.InactiveThreshold(time.Minute*5))

Long-term: establish governance

Set organizational guidelines for replica counts. Define when R3 is justified (financial transactions, audit logs, data that cannot be rebuilt) versus when R1 is appropriate (caches, ephemeral data, streams backed by external systems). Make R1 the default; require justification for R3.

Monitor Raft group growth. Track the HA asset count over time. Alert before it reaches the threshold.

Implement stream lifecycle management. Require streams and consumers to have retention limits (max_age, max_bytes) and review unused assets periodically. Synadia Insights’ inactive stream (OPT_IDLE_002) and inactive consumer (OPT_IDLE_003) checks automate this review.

Budget Raft groups per cluster. As a guideline, keep the total HA asset count below 1,000 per server for clusters running on typical hardware. High-performance deployments can handle more, but the overhead should be monitored. If you need thousands of streams, consider whether some workloads can use R1 or separate clusters.

Frequently asked questions

How many Raft groups is too many for a NATS server?

The default threshold is 1,000 HA assets per server, but the actual limit depends on hardware. Each Raft group consumes CPU for heartbeats and elections, memory for log buffers, and network bandwidth for replication. On a 4-core server with moderate network bandwidth, 500–1,000 groups is manageable. Beyond that, you’ll see increasing election latency and snapshot duration. Monitor meta snapshot time (META_004) and Raft apply lag (OPT_SYS_007) as indicators of Raft overhead.

Do R1 streams create Raft groups?

No. R1 streams and consumers do not use Raft — there’s no consensus needed with a single replica. Switching from R3 to R1 eliminates the Raft group entirely. This is why R1 is strongly recommended for any stream that doesn’t require high availability.

What happens during a mass leader election with many Raft groups?

When a server restarts or a network partition heals, every Raft group that had a member on the affected server initiates an election. With hundreds of groups, the election messages compete for the same internal communication channels. Elections may fail due to timeouts, retry with randomized backoff, and create a “thundering herd” effect. The cluster typically stabilizes within 30–60 seconds, but during that window, affected streams and consumers may be unavailable for writes.

Can I reduce replica count on a live stream without downtime?

Yes. nats stream edit <name> --replicas 1 takes effect immediately. The stream remains available throughout the change. The extra replicas are removed, and the Raft group is dissolved (for R3→R1) or reduced. Consumers on the stream continue operating without interruption.

Should I consolidate streams or reduce replica counts first?

Reduce replica counts first — it’s faster, lower risk, and has the most immediate impact on Raft group count. Consolidating streams requires migrating data and updating consumer configurations, which is more involved. Start by auditing which R3 streams can safely become R1, then plan stream consolidation as a longer-term project.

Proactive monitoring for NATS high ha assets with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel