Checks/META_001

NATS Offline Replica: What It Means and How to Fix It

Severity
Critical
Category
Health
Applies to
Meta Cluster
Check ID
META_001
Detection threshold
Any meta cluster peer reported as offline

A meta cluster replica is offline — one of the JetStream-enabled servers in the cluster is not participating in the meta group’s Raft consensus. This reduces the cluster’s fault tolerance and, depending on cluster size, may put the system one failure away from losing meta quorum entirely.

Why this matters

The meta cluster is the control plane for all of JetStream. Every JetStream-enabled server joins the meta group, which uses Raft consensus to manage stream and consumer placement, process JetStream API requests, and maintain cluster-wide state. The meta group requires a quorum — a majority of peers — to function. In a three-node cluster, quorum requires two of three peers. Losing one replica doesn’t break quorum, but it eliminates all fault tolerance: one more failure and the JetStream API goes down.

When a meta replica is offline, the cluster continues operating normally for existing workloads. Messages flow, consumers process, and streams accept writes (assuming their own Raft groups have quorum). But the meta group is degraded. If the leader needs to make a placement decision — assigning a new stream replica, rebalancing after a topology change — it has fewer peers to replicate to. Operations that require meta consensus may be slower because the leader waits for responses from a smaller peer set.

The real danger is the reduced margin for error. A three-node cluster with one offline replica has zero tolerance for additional failures. If a second server goes down for any reason — planned maintenance, a crash, a network partition — meta quorum is lost and the JetStream API stalls cluster-wide. No new streams, no new consumers, no configuration changes. Existing data continues to flow through streams that still have their own quorum, but the control plane is frozen.

Common causes

  • Server process crashed or was killed. The most straightforward cause. The NATS server process exited due to a panic, OOM kill, or manual termination. The server’s Raft voter disappears from the group.

  • Network partition isolating the server. The server is still running but can’t communicate with its peers. From the meta group’s perspective, the peer is offline because it’s not responding to Raft heartbeats. The server itself may believe it’s fine.

  • Disk failure or I/O errors. Raft requires writing log entries and snapshots to disk. If the disk is full, read-only, or experiencing I/O errors, the Raft subsystem cannot function and the server drops out of the meta group.

  • Resource exhaustion (OOM, CPU starvation). The server is technically running but so resource-constrained that it can’t process Raft heartbeats in time. The meta leader treats it as offline after the heartbeat timeout expires. Common on under-provisioned servers handling too many HA assets.

  • Server removed from cluster without peer cleanup. A server was decommissioned or replaced, but its peer entry was never removed from the meta group. The group keeps expecting it to respond, counting it as an offline peer indefinitely.

  • Aggressive rolling restart. During upgrades, if a server is restarted before the previously restarted server has fully rejoined the meta group, there’s a window where multiple peers appear offline simultaneously.

How to diagnose

Check meta group status

The most direct way to see meta group health:

Terminal window
nats server report jetstream

This displays the Raft Meta Group table with columns for each peer including Offline (true/false), Active (time since last activity), and Lag (operations behind the leader). Any peer showing Offline: true is the problem.

Identify the offline server

The meta group report shows server names. Cross-reference with:

Terminal window
nats server list

Servers that appear in the meta group but not in server list are unreachable. Servers that appear in both but show as offline in the meta group may have JetStream-specific issues (disk, Raft state) even though the core NATS process is running.

Check server health directly

If the server is reachable on its monitoring port:

Terminal window
curl -s http://<server-ip>:8222/healthz

A healthy response returns {"status":"ok"}. Any other response indicates a health issue. Check the specific error — it may point to JetStream storage, Raft state, or other subsystem failures.

For JetStream-specific state:

Terminal window
curl -s http://<server-ip>:8222/jsz | jq '{meta_leader: .meta_leader, meta_cluster: .meta_cluster}'

Check server logs

On the offline server (if accessible), look for:

1
[ERR] JetStream cluster unable to write to WAL
2
[WRN] Raft heartbeat timeout, stepping down
3
[ERR] Disk full, JetStream disabled

These indicate the specific failure that caused the server to drop out of the meta group.

Check system-level health

Terminal window
# Is the server process running?
systemctl status nats-server
# Check available disk space
df -h /path/to/jetstream/store
# Check system memory
free -h
# Check for OOM kills
dmesg | grep -i "out of memory"

How to fix it

Immediate: restore the replica

If the server crashed, restart it. The simplest fix — bring the server back:

Terminal window
systemctl start nats-server

The server will rejoin the meta group automatically, catch up on missed Raft entries, and resume voting. Monitor the meta group report to confirm it transitions from Offline: true to Offline: false and the Lag drops to zero.

If disk is full, free space first. The server can’t rejoin the meta group if it can’t write Raft log entries:

Terminal window
# Check what's consuming space
du -sh /path/to/jetstream/store/*
# If JetStream data is the culprit, consider purging low-priority streams
# after the server is running again

If the server is permanently gone, remove its peer. Don’t leave ghost peers in the meta group — remove it via nats server cluster peer-remove. This requires the remaining members to have quorum:

Terminal window
nats server cluster peer-remove <server-name>

This tells the meta group to stop expecting that peer. Quorum requirements adjust to the new, smaller group size. Note: this command can only succeed when the remaining peers still form a quorum — if quorum is already lost, you must restore enough peers first.

Short-term: prevent future offline events

Use lame duck mode for planned maintenance. Before stopping a server for upgrades or maintenance, gracefully drain it:

Terminal window
nats-server --signal ldm=<server-name> # SIGUSR2 puts the server in lame-duck mode

Or send a SIGINT/SIGTERM — the server enters lame duck mode, migrates leaders away, and shuts down cleanly. This ensures stream and meta leaders are moved before the server disappears.

Monitor server health proactively. Don’t wait for operators to notice an offline replica:

1
// Go: monitor meta group health
2
nc, _ := nats.Connect("nats://localhost:4222",
3
nats.UserInfo("sys_user", "sys_pass"),
4
)
5
6
// Request JetStream server info
7
resp, _ := nc.Request("$SYS.REQ.SERVER.PING.JSZ", nil, 2*time.Second)
8
// Parse and check for offline peers
1
# Python: check meta group via /jsz
2
import aiohttp
3
4
async def check_meta_replicas(monitor_urls: list[str]):
5
async with aiohttp.ClientSession() as session:
6
for url in monitor_urls:
7
async with session.get(f"{url}/jsz") as resp:
8
data = await resp.json()
9
meta = data.get("meta_cluster", {})
10
for peer in meta.get("replicas", []):
11
if peer.get("offline"):
12
print(f"ALERT: Meta peer {peer['name']} is offline")

Ensure adequate disk headroom. Set storage reservations to leave at least 10-20% of total disk for Raft WAL, snapshots, and OS overhead. Monitor disk usage at the OS level, not just JetStream utilization.

Long-term: build fault tolerance

Use odd-numbered cluster sizes of at least three. A three-node meta group tolerates one failure. A five-node group tolerates two. For production workloads where JetStream availability is critical, five nodes provides significantly better resilience during rolling upgrades and unexpected failures.

Automate server replacement. In orchestrated environments (Kubernetes, Nomad), configure your orchestrator to automatically replace failed NATS pods. Combined with peer removal for permanently lost servers, this keeps the meta group healthy without manual intervention.

Implement health-based alerting. Alert on the /healthz endpoint and on the meta group’s offline replica count.

Synadia Insights evaluates meta group health every collection epoch and fires a critical alert the moment any replica goes offline, giving you time to act before a second failure risks quorum loss.

Frequently asked questions

Can the cluster still operate with one offline meta replica?

Yes, as long as a quorum of peers remains online. In a three-node cluster, two of three peers maintain quorum. Existing streams and consumers continue operating normally. JetStream API requests are processed by the meta leader. The risk is that you’ve lost all fault tolerance — one more failure breaks quorum.

How long does it take for a restarted server to rejoin the meta group?

Typically seconds to a few minutes, depending on how much Raft log the server missed while offline. The server requests a snapshot from the leader and replays any log entries since that snapshot. For a server that was only offline briefly, this is nearly instant. For extended outages with high JetStream API activity, the catch-up may take longer.

Should I remove an offline peer or wait for it to come back?

If the server will be restored (planned maintenance, recoverable crash), wait. The server will rejoin automatically and catch up. If the server is permanently lost (hardware failure, decommissioned), remove it with nats server cluster peer-remove. Leaving a dead peer in the group permanently reduces your fault tolerance.

What’s the difference between a meta offline replica and a stream offline replica?

The meta group manages the JetStream control plane — stream and consumer placement, API operations, and cluster-wide state. A meta offline replica (META_001) affects the ability to manage JetStream. Stream offline replicas affect individual stream availability and replication. A meta outage is broader in impact because it blocks all JetStream management operations, while a stream replica outage only affects that specific stream.

How do I check if an offline replica caused any data issues?

After restoring the replica, check for replication lag:

Terminal window
nats server report jetstream

Verify the Lag column for the restored peer shows zero. Also check individual streams for replica lag with nats stream report. If the server was offline long enough that Raft logs were compacted, the leader will send a full snapshot — this is automatic and doesn’t require intervention.

Proactive monitoring for NATS offline replica with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel