A meta cluster size decrease means the NATS meta group — the Raft cluster that manages JetStream stream and consumer assignments — has fewer peers than it did at the previous collection interval. A peer was either intentionally removed via nats server cluster peer-remove or lost due to a server crash, network partition, or infrastructure failure. Either way, the cluster’s fault tolerance has been reduced.
The meta cluster is the control plane for all of JetStream. It decides which servers host which stream replicas, processes stream and consumer creation requests, and coordinates leader elections. The meta cluster’s health directly determines whether JetStream operations succeed or fail.
Fault tolerance shrinks immediately. A 5-peer meta cluster tolerates 2 simultaneous failures. Lose one peer and you’re down to tolerating 1. Lose another and quorum is at risk. Every peer lost without replacement narrows the safety margin. If you’re already running a 3-peer cluster (the minimum for HA), losing one peer means a single additional failure will cause a complete JetStream outage.
Quorum risk increases. The Raft consensus algorithm requires a strict majority of peers to be available. A 5-peer cluster needs 3; a 3-peer cluster needs 2. When the cluster size decreases, the quorum requirement doesn’t change until the peer is formally removed. If the peer crashed but wasn’t removed from Raft, the cluster still counts it for quorum purposes — meaning a “ghost” peer is consuming a quorum slot while contributing nothing.
Stream placement becomes constrained. JetStream distributes stream replicas across available peers. Fewer peers means fewer placement options, which leads to higher per-server storage and CPU load. Streams configured with R3 replication on a cluster that just dropped to 2 peers cannot place new replicas at all.
An unplanned decrease signals infrastructure failure. If nobody ran peer-remove, the decrease indicates a server died, lost network connectivity, or was terminated by infrastructure automation. The underlying cause needs investigation — it could recur, taking another peer with it.
Planned server decommission. An operator intentionally removed a peer using nats server cluster peer-remove as part of a maintenance or scaling operation. This is expected and healthy — but only if the remaining cluster has an odd number of peers and sufficient capacity.
Server crash or OOM kill. The NATS server process terminated unexpectedly due to a crash, out-of-memory kill, or unhandled panic. The meta cluster detects the peer as unresponsive after the Raft election timeout and eventually removes it from the peer list.
Infrastructure termination. Cloud autoscalers, spot instance reclamation, or Kubernetes pod eviction terminated the server’s host without a graceful NATS shutdown. The server disappears from the cluster without draining its JetStream assets.
Network partition. A network failure isolates one or more peers from the rest of the cluster. The reachable peers see the partitioned peer as lost. If the partition persists long enough, the peer may be removed from the Raft group.
Disk failure or corruption. The server’s JetStream storage becomes unavailable or corrupted. The server may shut itself down or become unable to participate in Raft consensus, effectively removing itself from the meta cluster.
Stale routes after a topology change. If the cluster.routes list on each server points to a peer that no longer exists (decommissioned VM, retired DNS name) and the surviving peers can’t gossip past it, the meta group may lose track of expected membership during the next election. NATS does not have a --cluster-size or peer_expect setting; cluster size is dictated by the meta group’s actual member count.
# Show meta cluster peers, leaders, and lagnats server report jetstream
# Detailed peer view (online/offline, current/lagging, leader)nats server listLook for the total peer count and compare it to your expected cluster size. Note any peers marked as offline or with high lag.
Check the NATS server logs on the meta leader for peer removal events:
# Search logs for peer removalgrep -i "peer remove\|removed peer\|meta cluster" /var/log/nats/nats-server.logAn intentional removal via nats server cluster peer-remove will show a clear administrative action. An unplanned loss will show timeout-based detection:
1[WRN] JetStream meta peer "server-3" is not current (offline 45s)2[INF] JetStream meta group peer removed: "server-3"# List all known serversnats server list
# Ping a specific servernats server ping server-3
# Check server infonats server info server-3If the server doesn’t respond, investigate at the infrastructure level — check VM/container status, network reachability, and system logs.
# Check if the cluster has an odd number of peersnats server report jetstream --json | jq '.meta.cluster_size'If the cluster is now at an even number of peers, you’ve lost optimal quorum tolerance. A 4-peer cluster requires 3 for quorum — the same as a 5-peer cluster — so you’ve gained no benefit from the extra peer.
1package main2
3import (4 "context"5 "fmt"6 "log"7
8 "github.com/nats-io/nats.go"9 "github.com/nats-io/nats.go/jetstream"10)11
12func main() {13 nc, _ := nats.Connect(nats.DefaultURL)14 js, _ := jetstream.New(nc)15
16 // Fetch account info which includes meta cluster details17 info, err := js.AccountInfo(context.Background())18 if err != nil {19 log.Fatal(err)20 }21
22 fmt.Printf("JetStream Peers: %d\n", info.Tiers[""].Peers)23
24 // List all streams to check placement health25 lister := js.ListStreams(context.Background())26 for si := range lister.Info() {27 if si.Cluster != nil {28 fmt.Printf("Stream %s: leader=%s replicas=%d\n",29 si.Config.Name, si.Cluster.Leader, len(si.Cluster.Replicas))30 }31 }32}Determine if quorum is intact. If the meta cluster still has a majority of its original peers online, JetStream operations continue normally. If quorum is lost, see META_001 (Meta Quorum Lost) for emergency recovery.
Check stream replica health. A lost peer may have been hosting stream replicas. Verify that all streams still have their configured replication factor:
nats stream reportStreams showing fewer replicas than configured will need attention once the cluster is stabilized.
If the peer loss was unplanned — bring the server back. Fix the underlying issue (restart the process, restore the VM, fix the disk) and bring the server back online. The meta cluster will recognize the returning peer and sync it:
# Verify the server rejoinednats server listnats server report jetstreamIf the peer cannot be recovered — formally remove it. Don’t leave a dead peer in the Raft group. Remove it so quorum calculations reflect reality:
nats server cluster peer-remove server-3Add a replacement peer. Start a new NATS server with the same cluster configuration. It will join the meta cluster automatically:
nats-server -c /etc/nats/nats-server.confVerify it joined:
nats server report jetstreamEnsure odd cluster sizes. Always run 3, 5, or 7 meta cluster peers. Even numbers waste a server without improving fault tolerance.
Keep cluster routes consistent across all servers. NATS does not expose a peer_expect/--cluster-size setting — cluster membership is whatever the meta group sees alive. Make sure every server lists the same cluster.routes, and update those routes whenever a peer is replaced or renamed:
1jetstream {2 store_dir: /data/jetstream3}4cluster {5 name: my-cluster6 routes: [7 nats-route://server-1:62228 nats-route://server-2:62229 nats-route://server-3:622210 ]11}Implement health monitoring. Use Synadia Insights to automatically detect meta cluster size changes across all your NATS deployments. Alert on any decrease so the team can respond before a second failure puts quorum at risk.
Use graceful shutdown procedures. Before decommissioning a server, drain its JetStream assets and formally remove it from the cluster. This ensures streams are migrated before the peer disappears.
It depends on the new size. Going from 5 to 4 peers means you still tolerate 1 failure (quorum needs 3 of 4). Going from 3 to 2 means zero tolerance for additional failures — if one more peer goes down, JetStream stops. Treat any decrease as urgent and restore the expected cluster size as soon as possible.
The meta cluster will detect under-replicated streams and attempt to place new replicas on remaining peers — if there are enough peers available and they have sufficient storage capacity. Streams configured with R3 replication need at least 3 peers; if you’re down to 2, the stream continues operating with reduced redundancy but cannot fully re-replicate until a third peer is available.
Check nats server report jetstream. A temporarily unreachable peer still appears in the peer list but shows as offline with increasing lag. A formally removed peer (via peer-remove) disappears from the list entirely. If a peer disappeared from the list without an explicit removal, the meta Raft group determined it was gone long enough to evict.
Yes. A flapping peer causes repeated leader elections and disrupts meta cluster stability. Remove it, fix the underlying issue (usually disk I/O, resource constraints, or network instability), and then rejoin it as a clean peer. A stable cluster with fewer peers is better than an unstable one with the “right” number.
The detection time depends on Raft configuration, but typically ranges from 2 to 10 seconds after the last successful heartbeat. The peer is marked as not current first, and if it remains unreachable through several election cycles, it may be formally removed. Synadia Insights detects the cluster size change at the next collection epoch.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community