NATS Even Cluster Size: What It Means and How to Fix It

Even cluster size means the JetStream meta cluster has an even number of peer servers. Raft consensus requires a strict majority (quorum = N/2 + 1) to elect a leader and commit operations. Even-numbered clusters have the same quorum requirements as odd — but an odd cluster tolerates one more failure for the same number of servers. For example, 3 nodes tolerate 1 failure; 4 nodes also tolerate only 1 failure. The extra server adds Raft overhead without improving resilience.

Why this matters

Raft quorum is floor(N/2) + 1. For a 3-node cluster, quorum is 2 — the cluster tolerates 1 failure. For a 4-node cluster, quorum is 3 — the cluster still tolerates only 1 failure. Both even and odd clusters use the same quorum formula, but an odd cluster tolerates one more failure for the same number of servers (e.g., 3 nodes tolerate 1 failure; 4 nodes also tolerate only 1 failure). That fourth server participates in every Raft operation (heartbeats, log replication, snapshot distribution) but doesn’t buy you an additional failure. You’re paying for the infrastructure, the network bandwidth, and the coordination overhead of an extra peer with zero improvement in fault tolerance.

The risk goes beyond wasted resources. In a network partition scenario, an even-numbered cluster can split exactly in half. A 4-node cluster that splits 2-2 has no majority on either side — neither partition can elect a leader, and JetStream API operations stall cluster-wide. A 3-node cluster can’t split evenly: one side always has 2 nodes and can maintain quorum. A 5-node cluster that splits 3-2 or 2-3 always has one side with a majority.

This isn’t a theoretical concern. Network partitions happen — switch failures, cable cuts, availability zone outages, cloud provider networking issues. When they do, the difference between an odd and even cluster size determines whether your JetStream cluster continues operating or halts completely. The fix is trivial (add or remove one node), but the consequences of not fixing it can be severe.

Common causes

Accidental over-provisioning. A 3-node cluster is expanded to 4 nodes because “more is better” without understanding Raft quorum math. The operator adds a server expecting improved resilience, not realizing the benefit only comes at the next odd number (5).
Failed decommission leaving an extra node. A server was supposed to be removed from the cluster (hardware refresh, migration to new infrastructure) but the decommission was never completed. The old server is still a meta group peer, making the cluster even-sized.
Availability zone expansion without planning. Adding a new availability zone to a 3-node cluster by placing one server in each of 4 zones. The operator achieves geographic distribution but creates an even-numbered cluster that’s more vulnerable to partitions, not less.
Kubernetes scaling to even replica count. A JetStream StatefulSet configured with 4 replicas — perhaps because the Kubernetes default or team convention is even numbers. The deployment works but creates a suboptimal cluster topology.
Server rejoining after prolonged outage. A server that was offline for an extended period (and may have been mentally “written off”) comes back online and rejoins the meta group, incrementing the peer count to an even number.

How to diagnose

Confirm the cluster size

List all JetStream-enabled servers and count them:

nats server list

The output shows each server, its JetStream status, and cluster membership. Count the servers with JetStream enabled — this is the meta cluster size. Alternatively:

nats server report jetstream

The meta group section shows the peer count directly.

Verify meta group membership

The meta group may not include all JetStream-enabled servers (for example, if a server just joined and hasn’t been added to the group yet). Check actual meta group peers:

nats server report jetstream

The meta group table lists each peer, their role (leader or follower), and their status (current or lagging). The number of rows is the meta cluster size.

Check for stale peers

If the even count is due to a decommissioned server that’s still in the peer list, it may show as offline:

nats server report jetstream

Look for peers with offline status. These may be servers that are no longer running but haven’t been formally removed from the meta group.

How to fix it

Option A: add a server (go from N to N+1)

Consider adding a server to make the meta cluster an odd size. If you’re at 2 nodes and need fault tolerance, or at 4 nodes and want to tolerate 2 failures, add a server to reach the next odd number:

Deploy and configure the new server:

1
# nats-server.conf for the new node
2
server_name: nats-5
3

4
jetstream {
5
    store_dir: /data/nats/jetstream
6
    max_mem: 4GB
7
    max_file: 100GB
8
}
9

10
cluster {
11
    name: production
12
    listen: 0.0.0.0:6222
13
    routes: [
14
        nats-route://nats-1:6222
15
        nats-route://nats-2:6222
16
        nats-route://nats-3:6222
17
        nats-route://nats-4:6222
18
    ]
19
}

Start the server. It automatically joins the cluster and the meta group. Verify:

nats server report jetstream

The meta group should now show the new odd number of peers.

Option B: remove a server (go from N to N-1)

If the extra server is unnecessary or was accidentally left in the cluster, remove it:

Put the server in lame duck mode first. This gracefully drains connections and allows Raft groups to migrate leadership before the server leaves:

nats-server --signal ldm=<server_name>  # SIGUSR2 puts the server in lame-duck mode

Wait for the server to complete its lame duck period, then shut it down. The meta group automatically adjusts its peer count. Verify:

nats server report jetstream

If the server is already offline and you need to remove it from the meta group, the remaining peers will eventually remove it after the configured peer timeout.

Choosing between add and remove

The right choice depends on your fault tolerance needs:

Current size	Add →	Tolerates	Remove →	Tolerates
2	3	1 failure	1	0 failures (no HA)
4	5	2 failures	3	1 failure
6	7	3 failures	5	2 failures

For most production deployments, 3 or 5 nodes is the target. 3 nodes for standard availability (tolerates 1 failure). 5 nodes for high availability across failure domains (tolerates 2 failures — lose an entire availability zone and continue).

Availability zone planning for 5-node clusters

The recommended layout for a 5-node cluster across 3 availability zones:

1
Zone A: nats-1, nats-2   (2 servers)
2
Zone B: nats-3, nats-4   (2 servers)
3
Zone C: nats-5            (1 server)

Losing any single zone leaves at least 3 servers — a quorum of 5. This is the standard pattern for cross-AZ JetStream deployments.

Frequently asked questions

Why can’t a 4-node cluster tolerate 2 failures?

Because quorum for 4 nodes is 3 (floor(4/2) + 1). If 2 nodes fail, only 2 remain — less than quorum. The cluster cannot elect a leader and all JetStream API operations stall. A 5-node cluster has quorum of 3, so losing 2 nodes leaves 3 — exactly quorum, enough to continue operating. The jump from “tolerates 1 failure” to “tolerates 2 failures” only happens at 5 nodes.

Do non-JetStream servers count toward cluster size?

No. The meta cluster only includes JetStream-enabled servers. You can run a mix of JetStream-enabled and standard NATS servers in the same cluster. Standard NATS servers handle messaging and routing but don’t participate in Raft and don’t count toward the meta cluster size. This is a valid approach: run 3 JetStream nodes for Raft quorum plus additional standard NATS nodes for message routing capacity.

Is a 2-node JetStream cluster ever acceptable?

A 2-node cluster has quorum of 2, meaning it tolerates zero failures — if either node goes down, quorum is lost and JetStream operations stall. This is strictly worse than a single node for availability (a single node at least operates until it fails; a 2-node cluster fails when either node fails). Use 2-node JetStream clusters only for development or testing, never for production.

Will this check fire during a rolling upgrade?

It can. If you’re upgrading a 3-node cluster by temporarily adding a 4th node before removing the old one, the cluster briefly has 4 peers. If a collection epoch falls within this window, the check fires. This is a false positive in the context of a planned upgrade — the alert resolves once the old server is decommissioned and the cluster returns to 3 peers.

Can I suppress this check for a planned expansion?

Synadia Insights allows configuring check thresholds. You can temporarily adjust the check to account for planned topology changes. However, the better approach is to complete the expansion quickly — add the new server and remove the old one in the same maintenance window so the cluster spends minimal time at an even count.

FEATURED

RESOURCES

Comparisons