NATS Meta Cluster Size Decreased: Detecting Lost or Removed Peers

A meta cluster size decrease means the NATS meta group — the Raft cluster that manages JetStream stream and consumer assignments — has fewer peers than it did at the previous collection interval. A peer was either intentionally removed via nats server cluster peer-remove or lost due to a server crash, network partition, or infrastructure failure. Either way, the cluster’s fault tolerance has been reduced.

Why this matters

The meta cluster is the control plane for all of JetStream. It decides which servers host which stream replicas, processes stream and consumer creation requests, and coordinates leader elections. The meta cluster’s health directly determines whether JetStream operations succeed or fail.

Fault tolerance shrinks immediately. A 5-peer meta cluster tolerates 2 simultaneous failures. Lose one peer and you’re down to tolerating 1. Lose another and quorum is at risk. Every peer lost without replacement narrows the safety margin. If you’re already running a 3-peer cluster (the minimum for HA), losing one peer means a single additional failure will cause a complete JetStream outage.

Quorum risk increases. The Raft consensus algorithm requires a strict majority of peers to be available. A 5-peer cluster needs 3; a 3-peer cluster needs 2. When the cluster size decreases, the quorum requirement doesn’t change until the peer is formally removed. If the peer crashed but wasn’t removed from Raft, the cluster still counts it for quorum purposes — meaning a “ghost” peer is consuming a quorum slot while contributing nothing.

Stream placement becomes constrained. JetStream distributes stream replicas across available peers. Fewer peers means fewer placement options, which leads to higher per-server storage and CPU load. Streams configured with R3 replication on a cluster that just dropped to 2 peers cannot place new replicas at all.

An unplanned decrease signals infrastructure failure. If nobody ran peer-remove, the decrease indicates a server died, lost network connectivity, or was terminated by infrastructure automation. The underlying cause needs investigation — it could recur, taking another peer with it.

Common causes

Planned server decommission. An operator intentionally removed a peer using nats server cluster peer-remove as part of a maintenance or scaling operation. This is expected and healthy — but only if the remaining cluster has an odd number of peers and sufficient capacity.
Server crash or OOM kill. The NATS server process terminated unexpectedly due to a crash, out-of-memory kill, or unhandled panic. The meta cluster detects the peer as unresponsive after the Raft election timeout and eventually removes it from the peer list.
Infrastructure termination. Cloud autoscalers, spot instance reclamation, or Kubernetes pod eviction terminated the server’s host without a graceful NATS shutdown. The server disappears from the cluster without draining its JetStream assets.
Network partition. A network failure isolates one or more peers from the rest of the cluster. The reachable peers see the partitioned peer as lost. If the partition persists long enough, the peer may be removed from the Raft group.
Disk failure or corruption. The server’s JetStream storage becomes unavailable or corrupted. The server may shut itself down or become unable to participate in Raft consensus, effectively removing itself from the meta cluster.
Stale routes after a topology change. If the cluster.routes list on each server points to a peer that no longer exists (decommissioned VM, retired DNS name) and the surviving peers can’t gossip past it, the meta group may lose track of expected membership during the next election. NATS does not have a --cluster-size or peer_expect setting; cluster size is dictated by the meta group’s actual member count.

How to diagnose

Check current meta cluster state

# Show meta cluster peers, leaders, and lag
nats server report jetstream

# Detailed peer view (online/offline, current/lagging, leader)
nats server list

Look for the total peer count and compare it to your expected cluster size. Note any peers marked as offline or with high lag.

Determine if the removal was intentional

Check the NATS server logs on the meta leader for peer removal events:

# Search logs for peer removal
grep -i "peer remove\|removed peer\|meta cluster" /var/log/nats/nats-server.log

An intentional removal via nats server cluster peer-remove will show a clear administrative action. An unplanned loss will show timeout-based detection:

1
[WRN] JetStream meta peer "server-3" is not current (offline 45s)
2
[INF] JetStream meta group peer removed: "server-3"

Check the health of the missing peer

# List all known servers
nats server list

# Ping a specific server
nats server ping server-3

# Check server info
nats server info server-3

If the server doesn’t respond, investigate at the infrastructure level — check VM/container status, network reachability, and system logs.

Verify quorum safety

# Check if the cluster has an odd number of peers
nats server report jetstream --json | jq '.meta.cluster_size'

If the cluster is now at an even number of peers, you’ve lost optimal quorum tolerance. A 4-peer cluster requires 3 for quorum — the same as a 5-peer cluster — so you’ve gained no benefit from the extra peer.

Monitor programmatically

1
package main
2

3
import (
4
  "context"
5
  "fmt"
6
  "log"
7

8
  "github.com/nats-io/nats.go"
9
  "github.com/nats-io/nats.go/jetstream"
10
)
11

12
func main() {
13
  nc, _ := nats.Connect(nats.DefaultURL)
14
  js, _ := jetstream.New(nc)
15

16
  // Fetch account info which includes meta cluster details
17
  info, err := js.AccountInfo(context.Background())
18
  if err != nil {
19
    log.Fatal(err)
20
  }
21

22
  fmt.Printf("JetStream Peers: %d\n", info.Tiers[""].Peers)
23

24
  // List all streams to check placement health
25
  lister := js.ListStreams(context.Background())
26
  for si := range lister.Info() {
27
    if si.Cluster != nil {
28
      fmt.Printf("Stream %s: leader=%s replicas=%d\n",
29
        si.Config.Name, si.Cluster.Leader, len(si.Cluster.Replicas))
30
    }
31
  }
32
}

How to fix it

Immediate: assess the damage

Determine if quorum is intact. If the meta cluster still has a majority of its original peers online, JetStream operations continue normally. If quorum is lost, see META_001 (Meta Quorum Lost) for emergency recovery.

Check stream replica health. A lost peer may have been hosting stream replicas. Verify that all streams still have their configured replication factor:

nats stream report

Streams showing fewer replicas than configured will need attention once the cluster is stabilized.

Short-term: restore the cluster

If the peer loss was unplanned — bring the server back. Fix the underlying issue (restart the process, restore the VM, fix the disk) and bring the server back online. The meta cluster will recognize the returning peer and sync it:

# Verify the server rejoined
nats server list
nats server report jetstream

If the peer cannot be recovered — formally remove it. Don’t leave a dead peer in the Raft group. Remove it so quorum calculations reflect reality:

nats server cluster peer-remove server-3

Add a replacement peer. Start a new NATS server with the same cluster configuration. It will join the meta cluster automatically:

nats-server -c /etc/nats/nats-server.conf

Verify it joined:

nats server report jetstream

Long-term: prevent surprise peer loss

Ensure odd cluster sizes. Always run 3, 5, or 7 meta cluster peers. Even numbers waste a server without improving fault tolerance.

Keep cluster routes consistent across all servers. NATS does not expose a peer_expect/--cluster-size setting — cluster membership is whatever the meta group sees alive. Make sure every server lists the same cluster.routes, and update those routes whenever a peer is replaced or renamed:

1
jetstream {
2
    store_dir: /data/jetstream
3
}
4
cluster {
5
    name: my-cluster
6
    routes: [
7
        nats-route://server-1:6222
8
        nats-route://server-2:6222
9
        nats-route://server-3:6222
10
    ]
11
}

Implement health monitoring. Use Synadia Insights to automatically detect meta cluster size changes across all your NATS deployments. Alert on any decrease so the team can respond before a second failure puts quorum at risk.

Use graceful shutdown procedures. Before decommissioning a server, drain its JetStream assets and formally remove it from the cluster. This ensures streams are migrated before the peer disappears.

Frequently asked questions

Is it safe to run with a decreased cluster size temporarily?

It depends on the new size. Going from 5 to 4 peers means you still tolerate 1 failure (quorum needs 3 of 4). Going from 3 to 2 means zero tolerance for additional failures — if one more peer goes down, JetStream stops. Treat any decrease as urgent and restore the expected cluster size as soon as possible.

What happens to streams that had replicas on the lost peer?

The meta cluster will detect under-replicated streams and attempt to place new replicas on remaining peers — if there are enough peers available and they have sufficient storage capacity. Streams configured with R3 replication need at least 3 peers; if you’re down to 2, the stream continues operating with reduced redundancy but cannot fully re-replicate until a third peer is available.

How do I tell the difference between a removed peer and a temporarily unreachable one?

Check nats server report jetstream. A temporarily unreachable peer still appears in the peer list but shows as offline with increasing lag. A formally removed peer (via peer-remove) disappears from the list entirely. If a peer disappeared from the list without an explicit removal, the meta Raft group determined it was gone long enough to evict.

Should I remove a peer that’s flapping (repeatedly joining and leaving)?

Yes. A flapping peer causes repeated leader elections and disrupts meta cluster stability. Remove it, fix the underlying issue (usually disk I/O, resource constraints, or network instability), and then rejoin it as a clean peer. A stable cluster with fewer peers is better than an unstable one with the “right” number.

How quickly does the meta cluster detect a lost peer?

The detection time depends on Raft configuration, but typically ranges from 2 to 10 seconds after the last successful heartbeat. The peer is marked as not current first, and if it remains unreachable through several election cycles, it may be formally removed. Synadia Insights detects the cluster size change at the next collection epoch.

FEATURED

RESOURCES

Comparisons