A stream replica count imbalance occurs when one server in a NATS cluster hosts significantly more stream replicas than its peers — more than 1.5× the cluster average, with at least 10 replicas and a minimum of 3 servers in the cluster. This uneven distribution creates disproportionate storage I/O, memory pressure, and Raft overhead on the overloaded server.
Every stream replica a server hosts costs resources. Each replica maintains its own Raft group for consensus, consuming CPU for heartbeats and log replication. Each replica stores a copy of the stream data, consuming disk I/O and storage capacity. Each replica holds message index structures in memory. When one server hosts far more replicas than its peers, these costs accumulate and create a performance gap within the cluster.
The operational impact is subtle but real. The overloaded server takes longer to process Raft proposals because it’s managing more groups. It generates more disk I/O because it’s writing more streams. During a leader election storm — triggered by a network blip or server restart — the server with the most replicas participates in the most elections simultaneously, consuming CPU at exactly the moment the cluster needs fast convergence. The result: slower recovery times and increased risk of cascading Raft timeouts.
Replica imbalance also undermines the fault tolerance model. If one server holds a disproportionate share of replicas, losing that server affects more streams than losing any other server. The blast radius of a single node failure is no longer uniform across the cluster — which is exactly the assumption that R3 replication is designed to provide.
No placement constraints on stream creation. When streams are created without placement tags, the meta leader assigns replicas based on current load at creation time. Over time, as streams are created and deleted at different rates, the distribution drifts. Servers that were lightly loaded when streams were created accumulate replicas.
Uneven server addition to the cluster. When new servers join the cluster, existing streams don’t automatically redistribute. New servers start with zero replicas and only receive replicas from newly created streams. Older servers keep their existing load.
Streams created in bursts. If many streams are created in quick succession (e.g., during a migration or provisioning run), the meta leader may favor the same set of servers before load information propagates, concentrating replicas.
Failed replica removals. When a server is replaced or decommissioned, its replicas should be removed and re-created on other servers. If this process fails partway through, some replicas remain on the replacement server while new ones accumulate on remaining servers.
Mixed cluster sizing. In a heterogeneous cluster where some servers have more resources, operators may intentionally direct more streams to larger servers. Over time, the intent gets lost but the imbalance remains, and the larger servers become bottlenecks despite their additional capacity.
nats server report jetstreamThis shows per-server JetStream statistics including the number of streams and consumers hosted. Compare the stream counts across servers — a server with 1.5× or more the cluster average is imbalanced.
nats stream reportThe output includes cluster placement information per stream, showing which servers host replicas. Identify streams that cluster on the overloaded server.
nats stream info <stream_name>The Cluster section shows the leader and all replicas with their current state (current, lagging, offline). Check which server appears most frequently as a replica host across your streams.
# Count replicas per server across all streamsnats stream report --json | jq '[.[] | .cluster.replicas[]?.name] | group_by(.) | map({server: .[0], count: length}) | sort_by(.count) | reverse'This aggregates replica counts per server, making the imbalance immediately visible.
nats server report jetstreamIf replica count imbalance exists, storage skew (OPT_BALANCE_005) usually follows. Servers with more replicas typically use more disk space. Check the File and Memory columns for correlation.
While this doesn’t fix replica placement, it reduces the CPU burden on the overloaded server by moving leadership (and therefore write I/O) to other replicas:
# Step down leaders on the overloaded servernats stream cluster step-down <stream_name>Repeat for streams where the overloaded server is the leader. This is fast and non-disruptive — clients experience a brief leader election (typically < 1 second).
For streams on the overloaded server, force replica migration by editing the stream’s placement:
1// Go — update stream placement to target specific servers2js, _ := nc.JetStream()3
4cfg := &nats.StreamConfig{5 Name: "ORDERS",6 Subjects: []string{"orders.>"},7 Replicas: 3,8 Placement: &nats.Placement{9 Tags: []string{"az:us-east-1a"},10 },11}12_, err := js.UpdateStream(cfg)Alternatively, for streams without specific placement requirements, you can remove and re-add a replica to force redistribution:
# Scale down then back up to trigger re-placementnats stream edit <stream_name> --replicas 1# Wait for convergencenats stream edit <stream_name> --replicas 3Caution: Scaling to R1 temporarily removes fault tolerance. Only do this for non-critical streams, and one stream at a time.
Establish a placement tag strategy that ensures even distribution at creation time:
1# Server configuration — tag each server with its identity2server_tags: ["group:A"]Create streams with placement constraints that spread replicas across groups:
nats stream add EVENTS \ --subjects "events.>" \ --replicas 3 \ --tag group:A \ --tag group:B \ --tag group:CBuild stream creation into your provisioning pipeline so every new stream gets explicit placement. Audit placement quarterly using nats stream report and redistribute as needed.
No. Existing stream replicas stay where they are. New servers only receive replicas from newly created streams or explicit replica migrations. If you add a server to reduce load, you need to manually move some replicas to it by editing stream placement or cycling replica counts.
There’s no hard limit, but Synadia recommends monitoring when any server exceeds 1,000 HA assets (see CLUSTER_003). At that point, Raft overhead — heartbeats, elections, log replication — becomes a measurable CPU and network cost. The imbalance check fires at 1.5× the cluster average, which catches relative imbalance regardless of absolute count.
No. Scaling to R1 removes the extra replicas but keeps the leader and all its data intact. Scaling back to R3 creates new replicas that sync from the leader. However, during the time the stream is at R1, you have no fault tolerance — if the remaining server goes down, the stream is unavailable. Perform this operation during maintenance windows and one stream at a time.
They’re related but distinct. Replica imbalance (this check) means one server hosts more stream copies. Leader imbalance (OPT_BALANCE_001) means one server leads more Raft groups. A server can host many replicas but lead few of them, or vice versa. Both are worth monitoring — replica imbalance affects storage and Raft participation, while leader imbalance affects write throughput and election load.
Placement tags help but don’t guarantee balance on their own. Tags constrain which servers are eligible for a stream’s replicas, but the meta leader still chooses among eligible servers. If all your streams use the same tag set, the meta leader distributes among those servers — but drift can still occur over time as streams are created and deleted. Combine tags with periodic auditing.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community