NATS Stream Quorum Lost: What It Means and How to Fix It

A JetStream stream loses quorum when enough of its Raft replicas go offline that the remaining members cannot form a majority. Without quorum, the stream cannot elect a leader — all publishes, consumes, and API operations against that stream stall completely until enough replicas return.

Why this matters

Quorum loss is a full outage for the affected stream. No messages can be published, no consumers can make progress, and no configuration changes can be applied. Unlike a single replica failure (where the remaining majority continues operating), quorum loss means the stream is completely frozen.

The blast radius depends on what the stream carries. A stream backing an order processing pipeline stalls all orders. A stream aggregating sensor telemetry creates a growing data gap. Any request-reply pattern that publishes to the affected stream will time out, cascading failures to upstream services.

Recovery is not automatic. Even after the missing servers come back online, Raft needs time to re-establish quorum, elect a leader, and replay any uncommitted log entries. During this window — which can range from seconds to minutes depending on state size — the stream remains unavailable. If the servers are permanently lost (disk failure, decommissioned without replacement), manual intervention is required to remove the dead peers and allow the remaining members to form a new quorum.

For R3 streams (the most common replicated configuration), losing any 2 of 3 servers causes quorum loss. For R5, losing 3 of 5. The math is simple: quorum requires (N/2) + 1 members. But in practice, correlated failures — a shared rack losing power, a Kubernetes node pool scaling down, a network partition isolating a minority — can take out multiple replicas simultaneously.

Common causes

Server outage. The most straightforward cause: servers hosting stream replicas crash, lose power, or are shut down for maintenance. If enough replicas are on affected servers, quorum is lost.
Network partition. A network split isolates replicas from each other. Even if all servers are running, they can’t communicate to maintain Raft consensus. The partition side with fewer replicas loses quorum.
Kubernetes pod eviction or node drain. In Kubernetes deployments, node drains, pod evictions, or PodDisruptionBudget misconfiguration can simultaneously terminate multiple NATS server pods that host replicas of the same stream.
Correlated infrastructure failure. Multiple servers in the same availability zone, rack, or cloud region go down together. If stream replicas aren’t spread across failure domains, a single infrastructure event takes out quorum.
Disk exhaustion. A server runs out of disk space and the NATS process exits or becomes unable to write Raft logs. If this happens on enough replica servers, quorum is lost.
Rolling upgrade gone wrong. During a rolling server upgrade, if servers are restarted too quickly without waiting for replicas to catch up, multiple replicas can be effectively offline at the same time.

How to diagnose

Identify streams without quorum

nats stream report

Streams with quorum loss will show fewer online replicas than the configured replica count. For a detailed view of a specific stream:

nats stream info <stream_name>

The cluster section shows each replica’s state. Look for replicas marked as offline or with no current entry.

Check which servers are down

nats server list

Cross-reference offline servers with the stream’s replica placement. To see which streams are assigned to which servers:

nats server report jetstream

Inspect Raft group health

nats server report jetstream --raft

This shows the Raft state for each stream group, including leader, term, committed index, and applied index. Streams without a leader have lost quorum.

Check for network partitions

If servers appear online but streams still lack quorum, suspect a network partition:

# Check route connectivity between servers
nats server list

# Check server logs for route disconnection events
# Look for: "[WRN] Route connection closed"

Verify disk space

nats server report jetstream

Check the Storage column. Servers at or near their storage limit may have stopped accepting Raft writes.

How to fix it

Immediate: restore quorum

Bring offline servers back online. This is the fastest path to recovery. Restart crashed servers, reconnect network-partitioned nodes, or resume drained Kubernetes pods:

# Check server status
nats server ping

# If using systemd
sudo systemctl start nats-server

# If using Kubernetes, check pod status
kubectl get pods -l app=nats -o wide

Once enough replicas reconnect to form a majority, Raft will automatically elect a leader and the stream resumes operation. No manual intervention is needed if the data is intact.

Monitor recovery progress:

# Watch the stream until a leader is elected
nats stream info <stream_name> --json | jq '.cluster'

Short-term: remove permanently lost peers

If a server is permanently gone (disk destroyed, node decommissioned), remove the failed peer via nats stream cluster peer-remove to lower the quorum requirement. This allows the remaining peers to elect a leader:

# Remove a dead peer from a stream's Raft group
nats stream cluster peer-remove <stream_name> <peer_name>

After removing the dead peer from an R3 stream, you’re down to R2 — which still has quorum (2 of 2) but has zero fault tolerance. The remaining peers can now elect a leader and resume operations. Add a new replica as soon as possible to restore full redundancy:

# The server will automatically find a suitable server for the new replica
# based on the stream's placement constraints
nats stream edit <stream_name> --replicas 3

Force a leader election if the stream is stuck despite having quorum:

nats stream cluster step-down <stream_name>

Long-term: prevent quorum loss

Spread replicas across failure domains. Use placement tags to ensure stream replicas land on servers in different racks, availability zones, or regions:

1
{
2
  "placement": {
3
    "tags": ["az:us-east-1a"]
4
  }
5
}

Configure each server with a unique tag representing its failure domain, then set stream placement constraints to require diversity.

Use odd replica counts. R3 and R5 are standard. R2 provides no fault tolerance improvement over R1 (losing either replica loses quorum). R4 tolerates the same number of failures as R3 but uses more resources.

Implement PodDisruptionBudgets in Kubernetes. Prevent Kubernetes from draining too many NATS pods simultaneously:

1
apiVersion: policy/v1
2
kind: PodDisruptionBudget
3
metadata:
4
  name: nats-pdb
5
spec:
6
  minAvailable: 2
7
  selector:
8
    matchLabels:
9
      app: nats

Monitor replica health proactively. Don’t wait for quorum loss — alert on the first offline replica (META_001) or on replica lag (JETSTREAM_001). These are early warnings that give you time to act before the stream goes down.

Frequently asked questions

How many replicas can I lose before quorum is lost?

For R3 (three replicas), you can lose 1. For R5, you can lose 2. The formula is: a group of N replicas tolerates floor((N-1)/2) failures. This is why odd replica counts are recommended — R4 tolerates the same failures as R3 (1) but costs an extra replica.

Does the stream recover automatically when servers come back?

Yes, if the data is intact. When enough replicas reconnect to form a majority, Raft automatically elects a leader and the stream resumes. The recovering replicas will catch up by replaying any log entries they missed. No manual intervention is needed unless a server is permanently lost.

Can I force quorum with fewer replicas than configured?

Not directly — Raft requires a true majority. But you can remove dead peers with nats stream cluster peer-remove to reduce the total replica count, which changes what constitutes a majority. For example, removing one dead peer from an R3 stream leaves R2, where both remaining replicas form quorum.

What happens to messages published during quorum loss?

They fail. Publishers receive an error (or timeout, depending on the client configuration). No messages are silently dropped — the publish explicitly fails. Applications should handle this with retry logic or by buffering locally until the stream recovers. JetStream publish acknowledgments make this deterministic: if you didn’t get an ack, the message wasn’t stored.

How do I prevent correlated replica failures?

Use placement tags to spread replicas across independent failure domains (different racks, availability zones, or regions). Configure your NATS servers with tags that represent their physical or logical failure domain, then set placement constraints on streams that require tag diversity across replicas.

FEATURED

RESOURCES

Comparisons