Even cluster size means the JetStream meta cluster has an even number of peer servers. Raft consensus requires a strict majority (quorum = N/2 + 1) to elect a leader and commit operations. Even-numbered clusters have the same quorum requirements as odd — but an odd cluster tolerates one more failure for the same number of servers. For example, 3 nodes tolerate 1 failure; 4 nodes also tolerate only 1 failure. The extra server adds Raft overhead without improving resilience.
Raft quorum is floor(N/2) + 1. For a 3-node cluster, quorum is 2 — the cluster tolerates 1 failure. For a 4-node cluster, quorum is 3 — the cluster still tolerates only 1 failure. Both even and odd clusters use the same quorum formula, but an odd cluster tolerates one more failure for the same number of servers (e.g., 3 nodes tolerate 1 failure; 4 nodes also tolerate only 1 failure). That fourth server participates in every Raft operation (heartbeats, log replication, snapshot distribution) but doesn’t buy you an additional failure. You’re paying for the infrastructure, the network bandwidth, and the coordination overhead of an extra peer with zero improvement in fault tolerance.
The risk goes beyond wasted resources. In a network partition scenario, an even-numbered cluster can split exactly in half. A 4-node cluster that splits 2-2 has no majority on either side — neither partition can elect a leader, and JetStream API operations stall cluster-wide. A 3-node cluster can’t split evenly: one side always has 2 nodes and can maintain quorum. A 5-node cluster that splits 3-2 or 2-3 always has one side with a majority.
This isn’t a theoretical concern. Network partitions happen — switch failures, cable cuts, availability zone outages, cloud provider networking issues. When they do, the difference between an odd and even cluster size determines whether your JetStream cluster continues operating or halts completely. The fix is trivial (add or remove one node), but the consequences of not fixing it can be severe.
Accidental over-provisioning. A 3-node cluster is expanded to 4 nodes because “more is better” without understanding Raft quorum math. The operator adds a server expecting improved resilience, not realizing the benefit only comes at the next odd number (5).
Failed decommission leaving an extra node. A server was supposed to be removed from the cluster (hardware refresh, migration to new infrastructure) but the decommission was never completed. The old server is still a meta group peer, making the cluster even-sized.
Availability zone expansion without planning. Adding a new availability zone to a 3-node cluster by placing one server in each of 4 zones. The operator achieves geographic distribution but creates an even-numbered cluster that’s more vulnerable to partitions, not less.
Kubernetes scaling to even replica count. A JetStream StatefulSet configured with 4 replicas — perhaps because the Kubernetes default or team convention is even numbers. The deployment works but creates a suboptimal cluster topology.
Server rejoining after prolonged outage. A server that was offline for an extended period (and may have been mentally “written off”) comes back online and rejoins the meta group, incrementing the peer count to an even number.
List all JetStream-enabled servers and count them:
nats server listThe output shows each server, its JetStream status, and cluster membership. Count the servers with JetStream enabled — this is the meta cluster size. Alternatively:
nats server report jetstreamThe meta group section shows the peer count directly.
The meta group may not include all JetStream-enabled servers (for example, if a server just joined and hasn’t been added to the group yet). Check actual meta group peers:
nats server report jetstreamThe meta group table lists each peer, their role (leader or follower), and their status (current or lagging). The number of rows is the meta cluster size.
If the even count is due to a decommissioned server that’s still in the peer list, it may show as offline:
nats server report jetstreamLook for peers with offline status. These may be servers that are no longer running but haven’t been formally removed from the meta group.
Consider adding a server to make the meta cluster an odd size. If you’re at 2 nodes and need fault tolerance, or at 4 nodes and want to tolerate 2 failures, add a server to reach the next odd number:
Deploy and configure the new server:
1# nats-server.conf for the new node2server_name: nats-53
4jetstream {5 store_dir: /data/nats/jetstream6 max_mem: 4GB7 max_file: 100GB8}9
10cluster {11 name: production12 listen: 0.0.0.0:622213 routes: [14 nats-route://nats-1:622215 nats-route://nats-2:622216 nats-route://nats-3:622217 nats-route://nats-4:622218 ]19}Start the server. It automatically joins the cluster and the meta group. Verify:
nats server report jetstreamThe meta group should now show the new odd number of peers.
If the extra server is unnecessary or was accidentally left in the cluster, remove it:
Put the server in lame duck mode first. This gracefully drains connections and allows Raft groups to migrate leadership before the server leaves:
nats-server --signal ldm=<server_name> # SIGUSR2 puts the server in lame-duck modeWait for the server to complete its lame duck period, then shut it down. The meta group automatically adjusts its peer count. Verify:
nats server report jetstreamIf the server is already offline and you need to remove it from the meta group, the remaining peers will eventually remove it after the configured peer timeout.
The right choice depends on your fault tolerance needs:
| Current size | Add → | Tolerates | Remove → | Tolerates |
|---|---|---|---|---|
| 2 | 3 | 1 failure | 1 | 0 failures (no HA) |
| 4 | 5 | 2 failures | 3 | 1 failure |
| 6 | 7 | 3 failures | 5 | 2 failures |
For most production deployments, 3 or 5 nodes is the target. 3 nodes for standard availability (tolerates 1 failure). 5 nodes for high availability across failure domains (tolerates 2 failures — lose an entire availability zone and continue).
The recommended layout for a 5-node cluster across 3 availability zones:
1Zone A: nats-1, nats-2 (2 servers)2Zone B: nats-3, nats-4 (2 servers)3Zone C: nats-5 (1 server)Losing any single zone leaves at least 3 servers — a quorum of 5. This is the standard pattern for cross-AZ JetStream deployments.
Because quorum for 4 nodes is 3 (floor(4/2) + 1). If 2 nodes fail, only 2 remain — less than quorum. The cluster cannot elect a leader and all JetStream API operations stall. A 5-node cluster has quorum of 3, so losing 2 nodes leaves 3 — exactly quorum, enough to continue operating. The jump from “tolerates 1 failure” to “tolerates 2 failures” only happens at 5 nodes.
No. The meta cluster only includes JetStream-enabled servers. You can run a mix of JetStream-enabled and standard NATS servers in the same cluster. Standard NATS servers handle messaging and routing but don’t participate in Raft and don’t count toward the meta cluster size. This is a valid approach: run 3 JetStream nodes for Raft quorum plus additional standard NATS nodes for message routing capacity.
A 2-node cluster has quorum of 2, meaning it tolerates zero failures — if either node goes down, quorum is lost and JetStream operations stall. This is strictly worse than a single node for availability (a single node at least operates until it fails; a 2-node cluster fails when either node fails). Use 2-node JetStream clusters only for development or testing, never for production.
It can. If you’re upgrading a 3-node cluster by temporarily adding a 4th node before removing the old one, the cluster briefly has 4 peers. If a collection epoch falls within this window, the check fires. This is a false positive in the context of a planned upgrade — the alert resolves once the old server is decommissioned and the cluster returns to 3 peers.
Synadia Insights allows configuring check thresholds. You can temporarily adjust the check to account for planned topology changes. However, the better approach is to complete the expansion quickly — add the new server and remove the old one in the same maintenance window so the cluster spends minimal time at an even count.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community