Checks/OPT_BALANCE_004

NATS Stream Replica Count Imbalance: What It Means and How to Fix It

Severity
Info
Category
Saturation
Applies to
Balance
Check ID
OPT_BALANCE_004
Detection threshold
Replicas > 1.5× cluster average (minimum 3 servers, minimum 10 replicas)

A stream replica count imbalance occurs when one server in a NATS cluster hosts significantly more stream replicas than its peers — more than 1.5× the cluster average, with at least 10 replicas and a minimum of 3 servers in the cluster. This uneven distribution creates disproportionate storage I/O, memory pressure, and Raft overhead on the overloaded server.

Why this matters

Every stream replica a server hosts costs resources. Each replica maintains its own Raft group for consensus, consuming CPU for heartbeats and log replication. Each replica stores a copy of the stream data, consuming disk I/O and storage capacity. Each replica holds message index structures in memory. When one server hosts far more replicas than its peers, these costs accumulate and create a performance gap within the cluster.

The operational impact is subtle but real. The overloaded server takes longer to process Raft proposals because it’s managing more groups. It generates more disk I/O because it’s writing more streams. During a leader election storm — triggered by a network blip or server restart — the server with the most replicas participates in the most elections simultaneously, consuming CPU at exactly the moment the cluster needs fast convergence. The result: slower recovery times and increased risk of cascading Raft timeouts.

Replica imbalance also undermines the fault tolerance model. If one server holds a disproportionate share of replicas, losing that server affects more streams than losing any other server. The blast radius of a single node failure is no longer uniform across the cluster — which is exactly the assumption that R3 replication is designed to provide.

Common causes

  • No placement constraints on stream creation. When streams are created without placement tags, the meta leader assigns replicas based on current load at creation time. Over time, as streams are created and deleted at different rates, the distribution drifts. Servers that were lightly loaded when streams were created accumulate replicas.

  • Uneven server addition to the cluster. When new servers join the cluster, existing streams don’t automatically redistribute. New servers start with zero replicas and only receive replicas from newly created streams. Older servers keep their existing load.

  • Streams created in bursts. If many streams are created in quick succession (e.g., during a migration or provisioning run), the meta leader may favor the same set of servers before load information propagates, concentrating replicas.

  • Failed replica removals. When a server is replaced or decommissioned, its replicas should be removed and re-created on other servers. If this process fails partway through, some replicas remain on the replacement server while new ones accumulate on remaining servers.

  • Mixed cluster sizing. In a heterogeneous cluster where some servers have more resources, operators may intentionally direct more streams to larger servers. Over time, the intent gets lost but the imbalance remains, and the larger servers become bottlenecks despite their additional capacity.

How to diagnose

Check replica distribution across servers

Terminal window
nats server report jetstream

This shows per-server JetStream statistics including the number of streams and consumers hosted. Compare the stream counts across servers — a server with 1.5× or more the cluster average is imbalanced.

List streams with their cluster placement

Terminal window
nats stream report

The output includes cluster placement information per stream, showing which servers host replicas. Identify streams that cluster on the overloaded server.

Inspect a specific stream’s replica placement

Terminal window
nats stream info <stream_name>

The Cluster section shows the leader and all replicas with their current state (current, lagging, offline). Check which server appears most frequently as a replica host across your streams.

Quantify the imbalance

Terminal window
# Count replicas per server across all streams
nats stream report --json | jq '[.[] | .cluster.replicas[]?.name] | group_by(.) | map({server: .[0], count: length}) | sort_by(.count) | reverse'

This aggregates replica counts per server, making the imbalance immediately visible.

Check for corresponding storage skew

Terminal window
nats server report jetstream

If replica count imbalance exists, storage skew (OPT_BALANCE_005) usually follows. Servers with more replicas typically use more disk space. Check the File and Memory columns for correlation.

How to fix it

Immediate: redistribute leaders away from the overloaded server

While this doesn’t fix replica placement, it reduces the CPU burden on the overloaded server by moving leadership (and therefore write I/O) to other replicas:

Terminal window
# Step down leaders on the overloaded server
nats stream cluster step-down <stream_name>

Repeat for streams where the overloaded server is the leader. This is fast and non-disruptive — clients experience a brief leader election (typically < 1 second).

Short-term: manually rebalance replicas

For streams on the overloaded server, force replica migration by editing the stream’s placement:

1
// Go — update stream placement to target specific servers
2
js, _ := nc.JetStream()
3
4
cfg := &nats.StreamConfig{
5
Name: "ORDERS",
6
Subjects: []string{"orders.>"},
7
Replicas: 3,
8
Placement: &nats.Placement{
9
Tags: []string{"az:us-east-1a"},
10
},
11
}
12
_, err := js.UpdateStream(cfg)

Alternatively, for streams without specific placement requirements, you can remove and re-add a replica to force redistribution:

Terminal window
# Scale down then back up to trigger re-placement
nats stream edit <stream_name> --replicas 1
# Wait for convergence
nats stream edit <stream_name> --replicas 3

Caution: Scaling to R1 temporarily removes fault tolerance. Only do this for non-critical streams, and one stream at a time.

Long-term: use placement tags systematically

Establish a placement tag strategy that ensures even distribution at creation time:

1
# Server configuration — tag each server with its identity
2
server_tags: ["group:A"]

Create streams with placement constraints that spread replicas across groups:

Terminal window
nats stream add EVENTS \
--subjects "events.>" \
--replicas 3 \
--tag group:A \
--tag group:B \
--tag group:C

Build stream creation into your provisioning pipeline so every new stream gets explicit placement. Audit placement quarterly using nats stream report and redistribute as needed.

Frequently asked questions

Does NATS automatically rebalance replicas when a new server joins?

No. Existing stream replicas stay where they are. New servers only receive replicas from newly created streams or explicit replica migrations. If you add a server to reduce load, you need to manually move some replicas to it by editing stream placement or cycling replica counts.

How many replicas per server is too many?

There’s no hard limit, but Synadia recommends monitoring when any server exceeds 1,000 HA assets (see CLUSTER_003). At that point, Raft overhead — heartbeats, elections, log replication — becomes a measurable CPU and network cost. The imbalance check fires at 1.5× the cluster average, which catches relative imbalance regardless of absolute count.

Will scaling a stream down to R1 and back to R3 lose data?

No. Scaling to R1 removes the extra replicas but keeps the leader and all its data intact. Scaling back to R3 creates new replicas that sync from the leader. However, during the time the stream is at R1, you have no fault tolerance — if the remaining server goes down, the stream is unavailable. Perform this operation during maintenance windows and one stream at a time.

What’s the relationship between replica imbalance and leader imbalance?

They’re related but distinct. Replica imbalance (this check) means one server hosts more stream copies. Leader imbalance (OPT_BALANCE_001) means one server leads more Raft groups. A server can host many replicas but lead few of them, or vice versa. Both are worth monitoring — replica imbalance affects storage and Raft participation, while leader imbalance affects write throughput and election load.

Can placement tags prevent imbalance?

Placement tags help but don’t guarantee balance on their own. Tags constrain which servers are eligible for a stream’s replicas, but the meta leader still chooses among eligible servers. If all your streams use the same tag set, the meta leader distributes among those servers — but drift can still occur over time as streams are created and deleted. Combine tags with periodic auditing.

Proactive monitoring for NATS stream replica count imbalance with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel