Checks/CONSUMER_003

NATS Consumer Quorum Lost: What It Means and How to Fix It

Severity
Critical
Category
Health
Applies to
Consumer
Check ID
CONSUMER_003
Detection threshold
offline replicas >= quorum needed (R > 1)

Consumer quorum lost means a replicated JetStream consumer has too many offline replicas to elect a leader. The consumer is completely stalled — no messages are delivered, no acknowledgments are processed, and downstream services receive nothing until quorum is restored.

Why this matters

Quorum is the minimum number of Raft group members that must be online for the group to function. For an R3 consumer, quorum is 2 — lose two of three replicas and the consumer is dead in the water. Unlike a single replica going offline (CONSUMER_001), where the consumer continues operating with reduced fault tolerance, quorum loss is a full stop.

The consumer stops delivering messages immediately. There’s no graceful degradation — Raft requires a majority to elect a leader, and without a leader, no operations proceed. Messages continue accumulating in the underlying stream, but the consumer can’t read them. Ack pending timers are irrelevant because no delivery is happening. Downstream services see a complete message blackout, and depending on the consumer’s role, this can cascade into broader system failures: unfulfilled orders, stale caches, broken event-driven workflows.

The recovery pressure is intense. Every second the consumer is down, the gap between the stream’s head and the consumer’s last delivered position grows. When quorum is restored and a leader is elected, the consumer must catch up — potentially delivering a burst of backlogged messages that overwhelms downstream services. If the consumer was a push consumer, the subscribing application may not be ready for a sudden spike. If it was a pull consumer with a batch size, the catch-up is more controlled, but the accumulated lag still represents a period of stale data.

Common causes

  • Multiple servers down simultaneously. The most common cause. Two of three servers hosting consumer replicas are offline at the same time — due to a datacenter power event, coordinated restarts, or cascading failures triggered by resource exhaustion.

  • Aggressive rolling restart. Restarting servers too quickly without waiting for Raft groups to stabilize. If the first server hasn’t fully rejoined and re-synced before the second server is taken down, the consumer briefly has only one online replica — below quorum for R3.

  • Network partition splitting the group. A network failure isolates two replicas from each other and from the leader. Each partition has fewer than quorum members, so no partition can elect a leader. The consumer stalls even though all servers are running.

  • Disk failures on multiple nodes. JetStream consumer state is disk-backed. Simultaneous disk failures on two of three replica hosts prevent those replicas from participating in the Raft group.

  • Resource exhaustion cascading across servers. One server hits OOM and is killed. The remaining servers absorb its workload, pushing them toward their own resource limits. A second server hits OOM. Now the consumer’s lost quorum.

How to diagnose

Check consumer status

Terminal window
nats consumer info <stream-name> <consumer-name>

In the Cluster section, look at the replica list. If enough replicas show as offline to prevent quorum (2 of 3 for R3, 3 of 5 for R5), the consumer has lost quorum. The Leader field will be empty.

Watch for quorum loss advisories

Terminal window
nats event --js-advisory

JetStream publishes a quorum loss advisory on:

1
$JS.EVENT.ADVISORY.CONSUMER.QUORUM_LOST.<STREAM>.<CONSUMER>

This fires the moment quorum is lost, giving you real-time notification before users report symptoms.

Check which servers are down

Terminal window
nats server list

Cross-reference the offline replicas from consumer info with the server list. If servers are missing entirely, the cause is server-level. If servers are present but replicas are still offline, investigate per-server health (disk, memory, Raft state).

Assess the scope of impact

Terminal window
nats consumer report <stream-name>

Check whether other consumers on the same stream are also affected. If multiple consumers lost quorum, the root cause is likely shared infrastructure (same servers hosting multiple consumer groups).

Check stream health too

Terminal window
nats stream info <stream-name>

Consumer replicas live on the same servers as their stream’s replicas. If the stream has also lost quorum (JETSTREAM_008), the consumer can’t recover until the stream is healthy.

How to fix it

Immediate: restore quorum

Bring offline servers back online. This is the fastest path to recovery. Every second matters — the message backlog grows while the consumer is stalled:

Terminal window
# On each offline server
systemctl restart nats-server
# Verify servers rejoin the cluster
nats server list
# Confirm consumer quorum is restored
nats consumer info <stream-name> <consumer-name>

Once quorum is restored, the group elects a leader and the consumer resumes delivery. Expect a burst of backlogged messages.

If servers can’t be recovered, remove dead peers so the group can form quorum with remaining members:

Terminal window
nats stream cluster peer-remove <stream-name> <dead-peer> # consumer placement is inherited from the stream's Raft group

After removing enough dead peers for the remaining replicas to form quorum, the consumer elects a leader and resumes. The consumer will then re-replicate to new servers to restore the target replica count.

Short-term: manage the recovery burst

Throttle pull consumer fetch sizes during catch-up. After quorum is restored, the consumer has a backlog. For pull consumers, use smaller batch sizes initially to avoid overwhelming downstream services:

1
// Go: controlled catch-up with smaller batches
2
sub, _ := js.PullSubscribe("orders.>", "processor")
3
// Start with small batches during catch-up
4
msgs, _ := sub.Fetch(10, nats.MaxWait(5*time.Second))
1
# Python: pull messages in controlled batches
2
import nats
3
4
async def catch_up():
5
nc = await nats.connect()
6
js = nc.jetstream()
7
sub = await js.pull_subscribe("orders.>", "processor")
8
# Small batches during catch-up
9
msgs = await sub.fetch(10, timeout=5)
10
for msg in msgs:
11
await process(msg)
12
await msg.ack()

Monitor lag during recovery:

Terminal window
# Watch the consumer's pending count decrease
watch -n 5 nats consumer info <stream-name> <consumer-name>

Long-term: prevent quorum loss

Use lame duck mode for all rolling restarts. Lame duck mode gives Raft groups time to elect new leaders before the server shuts down, preventing the “two replicas offline simultaneously” scenario:

Terminal window
# Signal lame duck before stopping
nats-server --signal ldm=<pid>
# Wait for all Raft groups to migrate leadership (30-60 seconds)
# Then stop the server

Stagger server restarts with sufficient stabilization time. Never restart the next server until all Raft groups on the previous server have fully rejoined and caught up. Monitor with:

Terminal window
nats server report jetstream

Wait until all groups show the restarted server as current before proceeding.

Use R5 for consumers that absolutely cannot tolerate downtime. R5 requires 3 of 5 replicas for quorum, meaning it survives two simultaneous failures:

Terminal window
nats stream edit <stream-name> --replicas=5

Spread replicas across failure domains. Use placement tags to ensure consumer replicas (which follow stream replica placement) are distributed across availability zones, racks, or power domains:

nats-server.conf
1
server_tags: ["az:us-east-1a"]

Set up quorum loss alerting. Alert the moment quorum is lost, not when users report symptoms.

Synadia Insights monitors consumer quorum automatically every collection epoch, correlating consumer health with server availability and providing actionable context for recovery.

Frequently asked questions

How is consumer quorum loss different from consumer replica offline?

Consumer Replica Offline (CONSUMER_001) means one replica is down but the consumer still functions — it’s a warning that fault tolerance is reduced. Consumer Quorum Lost (CONSUMER_003) means enough replicas are down that the consumer can’t operate at all. CONSUMER_001 is a degraded state; CONSUMER_003 is a stopped state. Fixing a CONSUMER_001 alert prevents it from becoming CONSUMER_003.

Do messages get lost when a consumer loses quorum?

No. Messages continue to be accepted by the stream (assuming the stream still has quorum). The consumer simply stops reading and delivering them. When quorum is restored, the consumer resumes from where it left off — the last acknowledged position. No messages are lost, but there is a delivery gap. Downstream services miss messages during the outage and receive them in a burst during catch-up.

Can I force a consumer to operate without quorum?

No. Raft consensus requires a majority by design — this is what prevents split-brain and data corruption. You can’t override it. The correct recovery path is either restoring enough replicas to form quorum, or removing dead peers so the surviving replicas constitute a quorum. If only one replica remains, removing the dead peers allows it to form a single-node quorum and resume.

How long can a consumer survive with one replica offline before it’s a problem?

There’s no time limit on operating with reduced quorum — the consumer functions normally with 2 of 3 replicas. The risk is purely about fault tolerance: one more failure causes quorum loss. The urgency depends on your environment’s failure probability. In a stable environment, hours or even days may be fine. During a maintenance window or known instability, minutes matter.

What happens to ack pending messages during quorum loss?

Nothing — they freeze in place. The ack pending timer can’t fire because the consumer can’t process any operations without a leader. When quorum is restored, ack pending timers resume from where they were. Messages that had been delivered but not acknowledged before quorum loss will eventually be redelivered if their ack deadline expires after recovery.

Proactive monitoring for NATS consumer quorum lost with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel