NATS Peer Expect: Detecting Stream Replica Count Mismatches

Every replicated JetStream stream has an expected number of peers — the leader plus its replicas. When the actual peer count reported by the Raft group does not match the operator-defined expectation set via the io.nats.monitor.peer-expect metadata tag, this check fires. A peer-count mismatch means either a replica is missing (the stream is under-replicated and at risk of data loss) or an extra peer has appeared (indicating a cluster state anomaly).

Why this matters

JetStream uses Raft consensus for replicated streams. A stream configured with num_replicas: 3 should have exactly three peers in its Raft group — one leader and two followers. The peer count directly determines the stream’s fault tolerance. With three peers, the stream can survive one server failure without data loss or write disruption. With two peers (one missing), a single additional failure causes quorum loss — the stream becomes read-only, and no new messages can be published.

A peer-count mismatch is one of the earliest signals that a stream’s high-availability guarantee is degraded. Unlike quorum loss (which is immediately visible because writes fail), a missing replica often goes unnoticed. The stream continues to accept writes and serve reads. Operators assume three-way replication is protecting their data, while in reality the stream is running with reduced redundancy. The problem only becomes apparent when a second failure occurs — at which point it’s too late.

Extra peers are rarer but equally concerning. They typically indicate a Raft group state anomaly, possibly caused by a server rejoining the cluster after an extended outage with stale state, or a bug in peer management. Extra peers can cause increased network traffic (more replication targets) and confusion in leader election behavior.

The peer-expect check makes the gap between “configured replicas” and “actual peers” explicitly visible. It is especially valuable in large clusters where dozens or hundreds of streams make manual verification impractical.

Common causes

Server permanently removed from cluster. A server that hosted a replica was decommissioned or permanently failed without the stream being reconfigured. The Raft group still expects the old peer but it never returns. The stream is under-replicated indefinitely.
Server offline or partitioned. A server hosting a replica is temporarily unreachable — hardware failure, network partition, or scheduled maintenance. While the server is down, the peer count drops. If maintenance windows are long, this can persist for hours or days.
Raft group not rebalanced after topology change. After adding or removing servers from the cluster, JetStream streams need to be redistributed. If nats server report jetstream shows unevenly placed replicas, some streams may have peers on servers that are no longer appropriate.
Server rejoined with stale state. A server that was offline for an extended period rejoins the cluster. If its JetStream data was not cleaned up, it may attempt to rejoin Raft groups with stale state, temporarily creating extra peers before the group stabilizes.
Stream created with fewer replicas than intended. The stream was created with num_replicas: 1 (no replication) but the operator expected replication. This is a configuration error rather than a runtime issue, but the peer-expect check catches it if the metadata tag is set to the intended value.

How to diagnose

Check the current peer count

nats stream info MY_STREAM --json | jq '.cluster.replicas | length + 1'

The +1 accounts for the leader, which is not included in the replicas array. Compare this total against your io.nats.monitor.peer-expect threshold.

View peer details

nats stream info MY_STREAM

The cluster section shows the leader and each replica with its current state, lag, and last active time. Look for replicas marked as offline or with unusually high lag.

Check the expected peer count

nats stream info MY_STREAM --json | jq '.config.metadata["io.nats.monitor.peer-expect"]'

Verify cluster membership

Confirm all servers in the cluster are online and part of the JetStream meta group:

nats server report jetstream

Look for servers that are missing, marked as offline, or not participating in JetStream. Cross-reference with the stream’s replica placement.

Identify which peer is missing

# List all servers in the cluster
nats server ls

# Compare with the stream's peers
nats stream info MY_STREAM --json | jq '.cluster | {leader: .leader, replicas: [.replicas[].name]}'

The missing peer is a server that should host a replica but does not appear in the stream’s peer list.

Programmatic detection in Go

1
js, _ := nc.JetStream()
2
info, _ := js.StreamInfo("MY_STREAM")
3

4
expectedPeers := 3 // your expected peer count
5
actualPeers := 1   // leader
6
if info.Cluster != nil {
7
    actualPeers += len(info.Cluster.Replicas)
8
}
9

10
if actualPeers != expectedPeers {
11
    log.Printf("ALERT: stream has %d peers, expected %d",
12
        actualPeers, expectedPeers)
13
    if info.Cluster != nil {
14
        log.Printf("  leader: %s", info.Cluster.Leader)
15
        for _, r := range info.Cluster.Replicas {
16
            log.Printf("  replica: %s (offline=%v, lag=%d)",
17
                r.Name, r.Offline, r.Lag)
18
        }
19
    }
20
}

Programmatic detection in Python

How to fix it

If a replica is missing (under-replicated)

Option 1: Wait for the server to return. If the missing peer is on a server undergoing planned maintenance, the replica will rejoin the Raft group automatically when the server comes back online. The replica will catch up from the leader.

Option 2: Force peer removal and re-replication. If the server is permanently gone, remove the stale peer from the Raft group so JetStream can place a new replica on a healthy server:

nats stream cluster peer-remove MY_STREAM OFFLINE_SERVER_NAME

JetStream will automatically select a new server for the replacement replica and begin replication from the leader.

If there are extra peers

Extra peers typically resolve themselves as the Raft group stabilizes. If the extra peer persists, remove it:

nats stream cluster peer-remove MY_STREAM EXTRA_SERVER_NAME

Preventive: set the peer-expect metadata tag

Define the expected peer count in the stream’s metadata so monitoring catches deviations automatically:

nats stream edit MY_STREAM \
  --metadata "io.nats.monitor.peer-expect=3"

Long-term: automate topology management

Monitor cluster health continuously. Use nats server report jetstream in your monitoring stack to detect server failures that affect stream replication.

Implement automated peer recovery. When a server is permanently removed from the cluster, automatically run peer-remove for all streams that had replicas on that server. This prevents streams from remaining under-replicated indefinitely.

Use placement tags. Configure stream placement tags to ensure replicas are distributed across failure domains (racks, availability zones, regions). This makes it easier to reason about which servers should host which stream peers:

nats stream edit MY_STREAM --tag "az:us-east-1a" --tag "az:us-east-1b"

Frequently asked questions

What is the difference between num_replicas and peer-expect?

num_replicas is the stream’s configuration parameter that tells JetStream how many copies of the data to maintain. peer-expect is an operator-defined monitoring threshold set via metadata. They are usually the same value, but peer-expect exists as a separate check because the actual peer count can diverge from num_replicas during failures, maintenance, or cluster transitions. The check compares actual peers against the operator’s explicit expectation, which may differ from the configured value in transitional states.

Does a peer-count mismatch mean I’m losing messages?

Not immediately. As long as a quorum of peers is available (a majority of the configured replica count), the stream continues to accept writes and replicate data. However, an under-replicated stream has reduced fault tolerance. If enough additional peers fail to break quorum, the stream becomes read-only and new writes are rejected. The risk is proportional to how far the actual peer count is below the expected count.

How quickly does JetStream replace a missing replica?

When you run peer-remove, JetStream immediately begins selecting a new server and replicating data. The time to full replication depends on the stream size and available network bandwidth. For large streams (hundreds of GBs), replication can take hours. During this window, the new replica is present in the peer list but has high lag.

Should peer-expect match num_replicas exactly?

In most cases, yes. Set io.nats.monitor.peer-expect to the same value as num_replicas. The only exception is during planned transitions — for example, if you are migrating a stream from R3 to R5, you might temporarily set peer-expect to a value that accounts for the in-progress change. Reset it to match the final configuration once the migration is complete.

Can I use this check for R1 (unreplicated) streams?

Yes. Setting peer-expect=1 on an R1 stream verifies that the stream’s single peer (the leader) is present. If the stream’s server goes offline, the peer count drops to zero. While this is equivalent to a stream being completely unavailable, the peer-expect check provides a consistent monitoring interface across all stream configurations.

FEATURED

RESOURCES

Comparisons