A JetStream stream loses quorum when enough of its Raft replicas go offline that the remaining members cannot form a majority. Without quorum, the stream cannot elect a leader — all publishes, consumes, and API operations against that stream stall completely until enough replicas return.
Quorum loss is a full outage for the affected stream. No messages can be published, no consumers can make progress, and no configuration changes can be applied. Unlike a single replica failure (where the remaining majority continues operating), quorum loss means the stream is completely frozen.
The blast radius depends on what the stream carries. A stream backing an order processing pipeline stalls all orders. A stream aggregating sensor telemetry creates a growing data gap. Any request-reply pattern that publishes to the affected stream will time out, cascading failures to upstream services.
Recovery is not automatic. Even after the missing servers come back online, Raft needs time to re-establish quorum, elect a leader, and replay any uncommitted log entries. During this window — which can range from seconds to minutes depending on state size — the stream remains unavailable. If the servers are permanently lost (disk failure, decommissioned without replacement), manual intervention is required to remove the dead peers and allow the remaining members to form a new quorum.
For R3 streams (the most common replicated configuration), losing any 2 of 3 servers causes quorum loss. For R5, losing 3 of 5. The math is simple: quorum requires (N/2) + 1 members. But in practice, correlated failures — a shared rack losing power, a Kubernetes node pool scaling down, a network partition isolating a minority — can take out multiple replicas simultaneously.
Server outage. The most straightforward cause: servers hosting stream replicas crash, lose power, or are shut down for maintenance. If enough replicas are on affected servers, quorum is lost.
Network partition. A network split isolates replicas from each other. Even if all servers are running, they can’t communicate to maintain Raft consensus. The partition side with fewer replicas loses quorum.
Kubernetes pod eviction or node drain. In Kubernetes deployments, node drains, pod evictions, or PodDisruptionBudget misconfiguration can simultaneously terminate multiple NATS server pods that host replicas of the same stream.
Correlated infrastructure failure. Multiple servers in the same availability zone, rack, or cloud region go down together. If stream replicas aren’t spread across failure domains, a single infrastructure event takes out quorum.
Disk exhaustion. A server runs out of disk space and the NATS process exits or becomes unable to write Raft logs. If this happens on enough replica servers, quorum is lost.
Rolling upgrade gone wrong. During a rolling server upgrade, if servers are restarted too quickly without waiting for replicas to catch up, multiple replicas can be effectively offline at the same time.
nats stream reportStreams with quorum loss will show fewer online replicas than the configured replica count. For a detailed view of a specific stream:
nats stream info <stream_name>The cluster section shows each replica’s state. Look for replicas marked as offline or with no current entry.
nats server listCross-reference offline servers with the stream’s replica placement. To see which streams are assigned to which servers:
nats server report jetstreamnats server report jetstream --raftThis shows the Raft state for each stream group, including leader, term, committed index, and applied index. Streams without a leader have lost quorum.
If servers appear online but streams still lack quorum, suspect a network partition:
# Check route connectivity between serversnats server list
# Check server logs for route disconnection events# Look for: "[WRN] Route connection closed"nats server report jetstreamCheck the Storage column. Servers at or near their storage limit may have stopped accepting Raft writes.
Bring offline servers back online. This is the fastest path to recovery. Restart crashed servers, reconnect network-partitioned nodes, or resume drained Kubernetes pods:
# Check server statusnats server ping
# If using systemdsudo systemctl start nats-server
# If using Kubernetes, check pod statuskubectl get pods -l app=nats -o wideOnce enough replicas reconnect to form a majority, Raft will automatically elect a leader and the stream resumes operation. No manual intervention is needed if the data is intact.
Monitor recovery progress:
# Watch the stream until a leader is electednats stream info <stream_name> --json | jq '.cluster'If a server is permanently gone (disk destroyed, node decommissioned), remove the failed peer via nats stream cluster peer-remove to lower the quorum requirement. This allows the remaining peers to elect a leader:
# Remove a dead peer from a stream's Raft groupnats stream cluster peer-remove <stream_name> <peer_name>After removing the dead peer from an R3 stream, you’re down to R2 — which still has quorum (2 of 2) but has zero fault tolerance. The remaining peers can now elect a leader and resume operations. Add a new replica as soon as possible to restore full redundancy:
# The server will automatically find a suitable server for the new replica# based on the stream's placement constraintsnats stream edit <stream_name> --replicas 3Force a leader election if the stream is stuck despite having quorum:
nats stream cluster step-down <stream_name>Spread replicas across failure domains. Use placement tags to ensure stream replicas land on servers in different racks, availability zones, or regions:
1{2 "placement": {3 "tags": ["az:us-east-1a"]4 }5}Configure each server with a unique tag representing its failure domain, then set stream placement constraints to require diversity.
Use odd replica counts. R3 and R5 are standard. R2 provides no fault tolerance improvement over R1 (losing either replica loses quorum). R4 tolerates the same number of failures as R3 but uses more resources.
Implement PodDisruptionBudgets in Kubernetes. Prevent Kubernetes from draining too many NATS pods simultaneously:
1apiVersion: policy/v12kind: PodDisruptionBudget3metadata:4 name: nats-pdb5spec:6 minAvailable: 27 selector:8 matchLabels:9 app: natsMonitor replica health proactively. Don’t wait for quorum loss — alert on the first offline replica (META_001) or on replica lag (JETSTREAM_001). These are early warnings that give you time to act before the stream goes down.
For R3 (three replicas), you can lose 1. For R5, you can lose 2. The formula is: a group of N replicas tolerates floor((N-1)/2) failures. This is why odd replica counts are recommended — R4 tolerates the same failures as R3 (1) but costs an extra replica.
Yes, if the data is intact. When enough replicas reconnect to form a majority, Raft automatically elects a leader and the stream resumes. The recovering replicas will catch up by replaying any log entries they missed. No manual intervention is needed unless a server is permanently lost.
Not directly — Raft requires a true majority. But you can remove dead peers with nats stream cluster peer-remove to reduce the total replica count, which changes what constitutes a majority. For example, removing one dead peer from an R3 stream leaves R2, where both remaining replicas form quorum.
They fail. Publishers receive an error (or timeout, depending on the client configuration). No messages are silently dropped — the publish explicitly fails. Applications should handle this with retry logic or by buffering locally until the stream recovers. JetStream publish acknowledgments make this deterministic: if you didn’t get an ack, the message wasn’t stored.
Use placement tags to spread replicas across independent failure domains (different racks, availability zones, or regions). Configure your NATS servers with tags that represent their physical or logical failure domain, then set placement constraints on streams that require tag diversity across replicas.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community