NATS Raft Group Peer Count Mismatch: Causes and Remediation

A Raft group peer count mismatch occurs when a NATS JetStream stream or consumer Raft group reports more active peers than the num_replicas value in its configuration. The extra peer is typically a ghost — a node that was supposed to be removed during a cluster operation but whose removal didn’t fully propagate. The group continues to function, but the stale peer consumes resources, complicates leader elections, and can mask real quorum problems.

Why this matters

Every Raft group in JetStream maintains a peer set that determines quorum. For an R3 stream, quorum requires agreement from 2 of 3 peers. If a fourth ghost peer lingers in the group, the quorum calculation shifts — the group now needs 3 of 4 peers to agree, making it harder to achieve consensus and more fragile during node outages.

The ghost peer creates several operational risks. First, the leader continues sending append entries to the stale peer, wasting network bandwidth and CPU on messages that will never be acknowledged. In clusters with hundreds of streams, the aggregate overhead of ghost peers across many Raft groups becomes measurable. Second, the stale peer inflates the reported replica count, making capacity planning inaccurate — operators think the stream has more redundancy than it actually does, or they see an unexpected peer count and waste time investigating a non-existent node. Third, during rolling upgrades or maintenance windows, the ghost peer can prevent clean leader elections if the group momentarily loses quorum while waiting for a response from a node that no longer participates.

The mismatch is also a symptom of incomplete operational procedures. If one peer-remove left a ghost behind, others likely did too. Finding and fixing these mismatches now prevents a slow accumulation of Raft group inconsistencies that become much harder to untangle later.

Common causes

Incomplete peer-remove during scaling down. An operator reduced num_replicas from 3 to 1, or used nats stream cluster peer-remove to evict a node, but the removal didn’t fully propagate. The removed peer’s entry persists in the Raft group’s peer set even though the node is no longer replicating data. This is the most common cause.
Node replacement without clean removal. A server was decommissioned and replaced with a new node. The new node joined the Raft group, but the old node’s peer entry was never explicitly removed. The group now has N+1 peers — the N configured replicas plus the ghost of the old node.
Raft state divergence after network partition. A network partition during a peer-remove operation can leave the group in a split state where some members processed the removal and others didn’t. When the partition heals, the group may settle with the stale peer still in the peer set if the removal command wasn’t retried.
Leadership transfer during replica count change. If the Raft group leader changes mid-operation while num_replicas is being decreased, the new leader may not complete the peer removal that the old leader initiated. The configuration update succeeds (num_replicas shows the new value), but the actual peer set retains the extra member.
Manual Raft group manipulation. Direct manipulation of JetStream metadata or Raft state files, typically during disaster recovery, can introduce peer set inconsistencies if not done carefully. The metadata says R3, but the Raft group’s internal peer list has 4 or 5 entries.

How to diagnose

Identify streams with peer count mismatches

List all streams and compare the configured replica count to the actual peer count:

nats stream report

Look for streams where the Replicas column shows more peers than expected. For a detailed view of a specific stream’s Raft group:

nats stream info ORDERS --json | jq '{
  config_replicas: .config.num_replicas,
  cluster_name: .cluster.name,
  leader: .cluster.leader,
  peers: [.cluster.replicas[].name],
  peer_count: (.cluster.replicas | length) + 1
}'

If peer_count exceeds config_replicas, the group has a mismatch.

Check consumer Raft groups too

Consumer Raft groups inherit the stream’s replica count but can independently develop mismatches:

nats consumer info ORDERS my-consumer --json | jq '{
  cluster_leader: .cluster.leader,
  peers: [.cluster.replicas[].name],
  peer_count: (.cluster.replicas | length) + 1
}'

Identify the ghost peer

Compare the Raft group’s peer list against currently known cluster members:

# List active servers
nats server list

# Compare against the stream's peer set
nats stream info ORDERS

A peer that appears in the stream’s replica list but not in nats server list is the ghost. If the ghost peer name matches a decommissioned or replaced server, that confirms the cause.

Check across all streams programmatically

1
import (
2
    "fmt"
3
    "github.com/nats-io/nats.go"
4
)
5

6
func checkPeerMismatches(js nats.JetStreamContext) error {
7
    for name := range js.StreamNames() {
8
        info, err := js.StreamInfo(name)
9
        if err != nil {
10
            return err
11
        }
12
        expected := info.Config.Replicas
13
        actual := len(info.Cluster.Replicas) + 1 // +1 for leader
14
        if actual > expected {
15
            fmt.Printf("MISMATCH: stream=%s expected=%d actual=%d extra=%d\n",
16
                name, expected, actual, actual-expected)
17
            for _, r := range info.Cluster.Replicas {
18
                fmt.Printf("  peer=%s current=%v offline=%v lag=%d\n",
19
                    r.Name, r.Current, r.Offline, r.Lag)
20
            }
21
        }
22
    }
23
    return nil
24
}

1
import asyncio
2
import nats
3

4
async def check_peer_mismatches():
5
    nc = await nats.connect()
6
    js = nc.jetstream()
7

8
    async for name in js.stream_names():
9
        info = await js.stream_info(name)
10
        expected = info.config.num_replicas
11
        actual = len(info.cluster.replicas) + 1  # +1 for leader
12
        if actual > expected:
13
            print(f"MISMATCH: stream={name} expected={expected} actual={actual}")
14
            for r in info.cluster.replicas:
15
                print(f"  peer={r.name} current={r.current} offline={r.offline} lag={r.lag}")
16

17
    await nc.close()
18

19
asyncio.run(check_peer_mismatches())

How to fix it

Remove the extra peer from the Raft group

Once you’ve identified the ghost peer, remove it from the stream’s Raft group:

nats stream cluster peer-remove ORDERS ghost-server-name

For consumer Raft groups, you may need to delete and recreate the consumer if the consumer doesn’t expose a direct peer-remove command:

# Export consumer config
nats consumer info ORDERS my-consumer --json > consumer-backup.json

# Delete and recreate
nats consumer rm ORDERS my-consumer -f
nats consumer add ORDERS --config consumer-backup.json

Verify the fix

After removing the extra peer, confirm the peer count matches the configured replicas:

nats stream info ORDERS

The replica list should now show exactly num_replicas - 1 followers plus 1 leader.

Prevent future mismatches

Always verify peer removal completed. After any peer-remove operation, check that the peer count decreased:

nats stream cluster peer-remove ORDERS old-node
# Wait a few seconds for propagation
nats stream info ORDERS

Script node decommissioning. When removing a server from the cluster, iterate over all streams and consumers hosted on that node and explicitly remove it from each Raft group before shutting down the server:

# Find all streams with replicas on the departing node
nats stream report --json | jq -r '.[] | select(.cluster.replicas[]?.name == "old-node") | .stream'

Monitor for mismatches continuously. Synadia Insights evaluates OPT_SYS_026 automatically across your deployment, flagging any Raft group where the observed peer count doesn’t match the configured replica count — before ghost peers accumulate and cause operational surprises.

Update num_replicas if the extra peer is intentional

In rare cases, the peer count mismatch exists because num_replicas was decreased in configuration but the intent is to keep all current replicas. If so, update the replica count to match reality:

nats stream edit ORDERS --replicas 3

Only do this if you genuinely want the additional replica. In most cases, the ghost peer should be removed instead.

Frequently asked questions

Does a peer count mismatch cause data loss?

No. The extra peer doesn’t corrupt data — the Raft protocol still functions correctly with an extra member. The risks are operational: degraded quorum math, wasted replication overhead, and misleading cluster topology. However, if the ghost peer pushes the group to an even number of voters, the group becomes less partition-tolerant, which indirectly increases the risk of unavailability (though not data loss) during network events.

Can this happen with the meta-group (cluster-level Raft)?

Yes. The JetStream meta-group is itself a Raft group and can develop the same mismatch if a server is removed from the cluster without a clean meta-group peer removal. Check the meta-group with nats server report jetstream and compare the listed nodes against your expected cluster membership.

How do I detect this across hundreds of streams?

Use the programmatic approach shown in the diagnosis section. For production environments, Synadia Insights runs OPT_SYS_026 continuously across all streams and consumers, generating alerts when any Raft group has more peers than its configured replica count.

Will the ghost peer eventually remove itself?

No. Raft peer sets are explicit — a peer remains in the group until it is actively removed via a configuration change or a peer-remove operation. The ghost peer will persist indefinitely, even if the corresponding server no longer exists in the cluster.

Is it safe to remove the extra peer while the stream is under load?

Yes. Peer removal is a Raft membership change that the leader coordinates without disrupting message flow. The leader proposes the membership change, the group commits it, and the removed peer stops receiving append entries. Active publishers and consumers are not affected.

FEATURED

RESOURCES

Comparisons