Checks/META_005

NATS Meta State Growth: What It Means and How to Fix It

Severity
Warning
Category
Saturation
Applies to
Meta Cluster
Check ID
META_005
Detection threshold
total JetStream asset replicas exceed configured threshold (default: 5,000)

Meta state growth means the total number of JetStream asset replicas (streams and consumers combined) tracked by the meta cluster has exceeded the configured threshold. The meta group maintains the placement and configuration state for every replicated stream and consumer. As this state grows, meta snapshots take longer, leader elections slow down, API operations degrade, and memory usage climbs on every JetStream-enabled server.

Why this matters

The meta cluster is the brain of JetStream. It tracks where every stream and consumer replica lives, what their configurations are, and which servers are responsible for them. Every replicated stream is a Raft group. Every replicated consumer is a Raft group. The meta leader coordinates across all of them. When the total count grows into the thousands, the coordination overhead becomes a scaling bottleneck.

The most visible symptom is meta snapshot latency. The meta leader periodically snapshots its state to disk so that followers can catch up efficiently after restarts. With 1,000 Raft groups, snapshots take milliseconds. With 10,000, they can take seconds. With 20,000+, snapshot operations can exceed timeout thresholds (META_004), delay leader elections, and cause the meta leader to miss Raft heartbeats — which triggers the very leader elections the snapshot was trying to support.

Memory is the other pressure point. Every JetStream server holds a copy of the meta group state in memory. More streams and consumers means more state, and that memory is not available for message caching or connection handling. In clusters that also serve high-throughput core NATS traffic, meta state memory competes directly with the working set for message routing. The growth is gradual — a few new streams per week is invisible, but after a year of unchecked growth, the cumulative state can strain servers that were originally provisioned for a much smaller workload.

Common causes

  • Unbounded stream creation without cleanup. Applications or automation that create streams for temporary purposes — batch jobs, test suites, one-off data imports — but never delete them afterward. Each stream persists in the meta state indefinitely, even if it’s empty and receiving no traffic.

  • Per-entity stream pattern. A design where each user, device, or tenant gets its own stream: USER-12345, DEVICE-67890, etc. This pattern linearly scales the number of Raft groups with the number of entities. At 1,000 entities with R3 streams, that’s 3,000 stream replicas — before counting consumers.

  • Consumer proliferation on shared streams. A single stream with dozens of consumers, each for a different service or processing stage. While the stream is shared, each consumer is its own Raft group. Multiply by the number of streams and replica factor, and consumer replicas can outnumber stream replicas by an order of magnitude.

  • Test and staging resources left behind. Development and staging environments that share a production cluster (or were never cleaned up) accumulate streams and consumers that serve no purpose but contribute to meta state size. CI/CD pipelines that create test streams without teardown are a common source.

  • High replica counts on low-priority streams. Streams configured with R3 or R5 replication when R1 would suffice. Each additional replica is another Raft group the meta leader must coordinate. Over-replication across many streams compounds the effect.

How to diagnose

Check total asset counts

Get the current stream and consumer counts across the cluster:

Terminal window
nats server report jetstream

This report shows the total number of streams and consumers per server, including replicas. The check fires when the total replica count across all assets exceeds the threshold (default: 5,000).

For a more detailed breakdown, query the meta leader directly:

Terminal window
nats server req jetstream --leader

Identify the largest contributors

List all streams across all accounts to find which accounts and streams are driving growth:

Terminal window
nats stream ls -a

For consumer counts per stream:

Terminal window
nats consumer ls <stream_name>

Look for streams with high consumer counts or accounts with many streams. The goal is to find the top contributors to total Raft group count — these are your optimization targets.

Check snapshot performance

If meta state is large enough to affect snapshots, the META_004 check may also be firing. Verify snapshot duration:

Terminal window
nats server report jetstream

The meta group section shows snapshot-related metrics. If snapshot duration is climbing over time, meta state growth is the likely cause.

Calculate the total Raft group count

Total Raft groups = sum of all stream replicas + sum of all consumer replicas across the cluster. For a rough estimate:

Terminal window
# Count streams and their replica factors
nats stream ls -a -j | jq '[.[].config.num_replicas] | add'
# Count consumers across all streams
nats stream ls -a -j | jq -r '.[].config.name' | while read s; do
nats consumer ls "$s" -j 2>/dev/null | jq 'length'
done | paste -sd+ | bc

Multiply each by their respective replica factors for the true Raft group count.

How to fix it

Immediate: audit and clean up

Delete unused streams. Identify streams with no recent messages and no active consumers. These are safe candidates for removal:

Terminal window
# Find streams with no recent activity
nats stream ls -a

Look for streams where last seq hasn’t changed in weeks or months. Verify no consumers are bound before deleting:

Terminal window
nats stream rm <stream_name>

Delete orphaned consumers. Consumers bound to streams but with no active subscribers waste Raft groups:

Terminal window
nats consumer ls <stream_name>

Look for consumers with zero pending messages and no recent delivery activity. If the subscribing application no longer exists, delete the consumer:

Terminal window
nats consumer rm <stream_name> <consumer_name>

Short-term: consolidate and reduce replicas

Replace per-entity streams with subject partitioning. Instead of creating a stream per user or device, use a single stream with subjects:

1
// Instead of creating one stream per user:
2
// js.AddStream("USER-12345", ...)
3
// js.AddStream("USER-67890", ...)
4
5
// Use a single stream with subject hierarchy:
6
cfg := jetstream.StreamConfig{
7
Name: "USERS",
8
Subjects: []string{"users.>"},
9
Replicas: 3,
10
}
11
stream, err := js.CreateOrUpdateStream(ctx, cfg)
12
13
// Filter by user at the consumer level:
14
cons, err := js.CreateOrUpdateConsumer(ctx, "USERS", jetstream.ConsumerConfig{
15
Durable: "user-12345-processor",
16
FilterSubject: "users.12345.>",
17
AckPolicy: jetstream.AckExplicitPolicy,
18
})
1
# Python — single stream with subject filtering
2
await js.add_stream(name="USERS", subjects=["users.>"], num_replicas=3)
3
sub = await js.subscribe("users.12345.>", durable="user-12345-processor")

This replaces N stream Raft groups with one — a dramatic reduction for entity-per-stream patterns.

Reduce replica counts on non-critical streams. Audit streams currently configured with R3 or R5 and determine whether R1 is sufficient:

Terminal window
nats stream edit <stream_name> --replicas 1

Every reduction from R3 to R1 eliminates 2 Raft groups per stream. Across hundreds of streams, this materially reduces meta state size.

Long-term: prevent unbounded growth

Set account-level limits for streams and consumers. Prevent any single account from creating unlimited JetStream resources:

1
{
2
"jetstream": {
3
"max_streams": 50,
4
"max_consumers": 200,
5
"max_mem": "1GB",
6
"max_disk": "10GB"
7
}
8
}

These limits cap the number of Raft groups any account can contribute to the meta state. Without them, a single runaway account can grow the meta state indefinitely.

Implement regular housekeeping. Schedule periodic audits — weekly or monthly — to identify and clean up unused streams and consumers. Synadia Insights automates this through checks like Inactive Stream (OPT_IDLE_002), Inactive Consumer (OPT_IDLE_003), and Drained Consumer (OPT_IDLE_004), which surface cleanup candidates without manual inventory.

Monitor growth trends. Track total Raft group count over time. A cluster growing by 100 groups per week will hit the threshold within a year even if each individual addition seems harmless. Trend monitoring catches the problem early, when cleanup is easy.

Frequently asked questions

How many Raft groups is too many?

The default threshold of 5,000 total replicas is a practical guideline based on the point where meta operations start to show measurable latency impact. Clusters on fast hardware (NVMe, dedicated CPU) can handle more. Clusters on shared or slower infrastructure may show degradation earlier. The key metric to watch is meta snapshot duration — if it’s climbing, you’re approaching your cluster’s limit regardless of the absolute count.

Do R1 streams contribute to meta state?

Yes. Every stream is tracked in the meta state regardless of replica count. An R1 stream is one Raft group; an R3 stream is three. R1 streams are “cheaper” in terms of Raft groups but still contribute to meta state. The primary benefit of reducing replica counts is fewer Raft groups to coordinate, which reduces meta leader overhead.

Can I move streams between accounts to rebalance?

Not directly — NATS doesn’t support moving a stream between accounts. You would need to create a new stream in the target account, mirror or copy the data, redirect publishers and consumers, then delete the original. For rebalancing meta state, it’s more practical to consolidate streams within accounts (combining per-entity streams) or reduce replica counts.

Does meta state growth affect message throughput?

Not directly — message publishing and consumption operate through stream Raft groups, which are independent of the meta group. However, if meta state growth causes the meta leader to become slow or unstable (triggering META_003 or META_004), the resulting leader elections and API latency can indirectly affect applications that create or modify consumers as part of their message processing flow.

How do I track meta state size over time?

The /jsz endpoint reports total streams and consumers. Monitor these values over time in your observability stack. In Prometheus, tracking nats_jsz_streams and nats_jsz_consumers with their replica factors gives you the total Raft group count. Synadia Insights evaluates this automatically and alerts when the growth trend approaches the threshold.

Proactive monitoring for NATS meta state growth with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel