Checks/CONSUMER_001

NATS Consumer Replica Offline: What It Means and How to Fix It

Severity
Critical
Category
Health
Applies to
Consumer
Check ID
CONSUMER_001
Detection threshold
consumer replica reported as offline (is_offline = true)

A consumer replica offline means one of the Raft group members for a replicated JetStream consumer is not responding. The consumer continues operating if quorum is maintained, but fault tolerance is reduced — one more failure could halt message delivery entirely.

Why this matters

Every replicated JetStream consumer (R3 or R5) maintains its own Raft group to track acknowledgment state, deliver sequences, and pending messages. When a replica goes offline, the Raft group loses a voter. For an R3 consumer, losing one replica means you’re operating on the minimum quorum of two — one more failure and message delivery stops completely.

The immediate risk is reduced fault tolerance. An R3 consumer normally tolerates one server failure. With a replica already offline, you’ve consumed that tolerance. A rolling restart, a second server failure, or even a network hiccup isolating another replica will push the consumer into quorum loss (CONSUMER_003). At that point, the consumer stops delivering messages, ack pending messages pile up, and downstream services see a complete blackout.

The subtler risk is divergence. While a replica is offline, the remaining group members continue processing acknowledgments and advancing the deliver sequence. When the offline replica eventually comes back, it must catch up — replaying committed Raft entries it missed. If it was offline for a long time and the Raft log has been compacted via snapshots, it needs a full state transfer from the leader. During catch-up, the consumer group has less redundancy and higher load on the remaining replicas. If multiple consumers have replicas on the same offline server, the recovery load when that server returns can create a resource spike.

Common causes

  • Server hosting the replica crashed or was stopped. The most straightforward cause. The server process died, was killed by the OS (OOM), or was intentionally stopped for maintenance. All Raft groups with replicas on that server — streams, consumers, and meta — go offline simultaneously.

  • Network partition isolating the replica. The server is running but can’t communicate with the other Raft group members. Route connections between cluster peers are down or intermittent. The replica appears offline to the group even though its host server is operational.

  • Disk failure on the replica’s server. JetStream consumer state is persisted to disk. A disk failure prevents the replica from loading or writing state, causing it to drop out of the Raft group.

  • Resource exhaustion (OOM kill). The server process was killed by the operating system’s OOM killer. Common on servers that host many streams and consumers with large pending message buffers, especially with memory-backed storage.

  • Server removed from cluster without consumer cleanup. A server was decommissioned but its consumer replicas were never migrated. The Raft group still expects the peer but it will never return.

How to diagnose

Check consumer cluster status

Terminal window
nats consumer info <stream-name> <consumer-name>

Look at the Cluster section. Each replica shows its name, whether it’s current, and its operational status. Offline replicas are explicitly marked.

List all consumers with issues

Terminal window
nats consumer report <stream-name>

This shows all consumers on a stream with their cluster status. Look for consumers with fewer online replicas than expected.

Check if the host server is running

Terminal window
nats server list

If the server hosting the offline replica doesn’t appear, it’s down entirely. If it appears but the consumer replica is still offline, the issue is specific to that Raft group (disk, state corruption, or resource exhaustion).

Watch for consumer advisories

Terminal window
nats event --js-advisory

JetStream publishes advisories on $JS.EVENT.ADVISORY.CONSUMER.LEADER_ELECTED.<STREAM>.<CONSUMER> when a new leader is elected, and $JS.EVENT.ADVISORY.CONSUMER.QUORUM_LOST.<STREAM>.<CONSUMER> if quorum is lost. These provide real-time visibility into consumer group health changes.

Check overall Raft health

Terminal window
nats server report jetstream

This gives a cluster-wide view of Raft group health. If multiple consumers (and streams) show offline replicas on the same server, the root cause is server-level, not consumer-specific.

How to fix it

Immediate: restore the replica

Bring the server back online. If the host server crashed or was stopped, restart it. The consumer replica will rejoin the Raft group and begin catching up:

Terminal window
# Restart the nats-server process
systemctl restart nats-server
# Verify it rejoined the cluster
nats server list

After the server starts, the replica needs time to catch up on missed Raft entries. Monitor progress:

Terminal window
# Watch the consumer's cluster info for the replica's lag to decrease
nats consumer info <stream-name> <consumer-name>

If the server can’t be recovered, the replica will remain offline. For R3 consumers, this is survivable but fragile — proceed to short-term fixes.

Short-term: restore full redundancy

Force a leader election if the offline replica was the leader. If the consumer’s leader went offline and a new leader wasn’t automatically elected (unlikely but possible during complex failure scenarios):

Terminal window
nats consumer cluster step-down <stream-name> <consumer-name>

Remove the dead peer if the server is permanently lost. If a server has been decommissioned and won’t return, remove its peer from the consumer’s Raft group. The consumer will re-replicate to an available server:

Terminal window
# Remove the dead peer
nats stream cluster peer-remove <stream-name> <peer-name> # consumer placement is inherited from the stream's Raft group

The consumer will automatically select a new server for the replacement replica based on the stream’s placement constraints.

Long-term: prevent future occurrences

Use lame duck mode for planned maintenance. Before stopping a server, signal it to gracefully migrate leadership and allow clients to reconnect elsewhere:

Terminal window
nats-server --signal ldm=<pid>

Lame duck mode gives Raft groups time to elect new leaders on other servers before the old server shuts down, avoiding the offline replica state entirely.

Implement server health monitoring. Alert on server availability before consumer replicas go offline:

1
// Go: monitor server health
2
nc, _ := nats.Connect(url)
3
resp, _ := nc.Request("$SYS.REQ.SERVER.PING", nil, time.Second)
4
// No response = server is down
1
# Python: check consumer replica status
2
import nats
3
4
async def check_consumers():
5
nc = await nats.connect()
6
js = nc.jetstream()
7
info = await js.consumer_info("orders", "processor")
8
for replica in info.cluster.replicas:
9
if replica.offline:
10
print(f"ALERT: replica {replica.name} offline for {info.stream_name}/{info.name}")

Use R5 for critical consumers. R5 consumers tolerate two simultaneous replica failures while maintaining quorum. For consumers where delivery downtime is unacceptable, the additional replicas provide a wider safety margin:

Terminal window
# Streams control consumer replica count — set R5 on the stream
nats stream edit <stream-name> --replicas=5

Monitor with Prometheus. Track consumer replica health across the fleet.

Synadia Insights evaluates consumer replica health automatically every collection epoch, alerting on offline replicas before they cascade into quorum loss.

Frequently asked questions

Does a consumer replica going offline cause message loss?

No. As long as the consumer retains quorum (at least 2 of 3 replicas for R3), message delivery continues normally. The offline replica misses updates but the remaining replicas maintain the full acknowledgment state. When the offline replica returns, it catches up from the leader. Message loss would require losing quorum AND having the stream’s replicas also fail.

How long does it take for a consumer replica to catch up after coming back online?

It depends on how much state was missed. If the Raft log still contains all entries since the replica went offline, catch-up is fast — typically seconds. If the log was compacted and a full snapshot transfer is needed, it takes longer — proportional to the consumer’s state size. For consumers with large ack pending sets, this can take minutes.

What’s the difference between CONSUMER_001 and META_001?

CONSUMER_001 detects offline replicas in consumer Raft groups. META_001 detects offline replicas in the meta cluster Raft group. They’re structurally identical checks applied to different Raft group types. A single server failure triggers both if that server hosted consumer replicas and was a meta cluster member. The impact differs: an offline meta replica affects the JetStream control plane, while an offline consumer replica affects one specific consumer’s fault tolerance.

Should I use R3 or R5 for consumer replicas?

R3 is appropriate for most workloads — it tolerates one failure while maintaining quorum. R5 is justified when the cost of consumer downtime is high enough to warrant two extra replicas per consumer. Note that consumer replica count is inherited from the stream — you can’t set a different replica count for the consumer independently.

Can I manually move a consumer replica to a different server?

Not directly. You can remove a peer from the consumer’s Raft group with nats consumer cluster peer-remove, and the system will automatically place a new replica on an available server. Use placement tags on streams to influence where replicas land.

Proactive monitoring for NATS consumer replica offline with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel