A meta cluster replica is offline — one of the JetStream-enabled servers in the cluster is not participating in the meta group’s Raft consensus. This reduces the cluster’s fault tolerance and, depending on cluster size, may put the system one failure away from losing meta quorum entirely.
The meta cluster is the control plane for all of JetStream. Every JetStream-enabled server joins the meta group, which uses Raft consensus to manage stream and consumer placement, process JetStream API requests, and maintain cluster-wide state. The meta group requires a quorum — a majority of peers — to function. In a three-node cluster, quorum requires two of three peers. Losing one replica doesn’t break quorum, but it eliminates all fault tolerance: one more failure and the JetStream API goes down.
When a meta replica is offline, the cluster continues operating normally for existing workloads. Messages flow, consumers process, and streams accept writes (assuming their own Raft groups have quorum). But the meta group is degraded. If the leader needs to make a placement decision — assigning a new stream replica, rebalancing after a topology change — it has fewer peers to replicate to. Operations that require meta consensus may be slower because the leader waits for responses from a smaller peer set.
The real danger is the reduced margin for error. A three-node cluster with one offline replica has zero tolerance for additional failures. If a second server goes down for any reason — planned maintenance, a crash, a network partition — meta quorum is lost and the JetStream API stalls cluster-wide. No new streams, no new consumers, no configuration changes. Existing data continues to flow through streams that still have their own quorum, but the control plane is frozen.
Server process crashed or was killed. The most straightforward cause. The NATS server process exited due to a panic, OOM kill, or manual termination. The server’s Raft voter disappears from the group.
Network partition isolating the server. The server is still running but can’t communicate with its peers. From the meta group’s perspective, the peer is offline because it’s not responding to Raft heartbeats. The server itself may believe it’s fine.
Disk failure or I/O errors. Raft requires writing log entries and snapshots to disk. If the disk is full, read-only, or experiencing I/O errors, the Raft subsystem cannot function and the server drops out of the meta group.
Resource exhaustion (OOM, CPU starvation). The server is technically running but so resource-constrained that it can’t process Raft heartbeats in time. The meta leader treats it as offline after the heartbeat timeout expires. Common on under-provisioned servers handling too many HA assets.
Server removed from cluster without peer cleanup. A server was decommissioned or replaced, but its peer entry was never removed from the meta group. The group keeps expecting it to respond, counting it as an offline peer indefinitely.
Aggressive rolling restart. During upgrades, if a server is restarted before the previously restarted server has fully rejoined the meta group, there’s a window where multiple peers appear offline simultaneously.
The most direct way to see meta group health:
nats server report jetstreamThis displays the Raft Meta Group table with columns for each peer including Offline (true/false), Active (time since last activity), and Lag (operations behind the leader). Any peer showing Offline: true is the problem.
The meta group report shows server names. Cross-reference with:
nats server listServers that appear in the meta group but not in server list are unreachable. Servers that appear in both but show as offline in the meta group may have JetStream-specific issues (disk, Raft state) even though the core NATS process is running.
If the server is reachable on its monitoring port:
curl -s http://<server-ip>:8222/healthzA healthy response returns {"status":"ok"}. Any other response indicates a health issue. Check the specific error — it may point to JetStream storage, Raft state, or other subsystem failures.
For JetStream-specific state:
curl -s http://<server-ip>:8222/jsz | jq '{meta_leader: .meta_leader, meta_cluster: .meta_cluster}'On the offline server (if accessible), look for:
1[ERR] JetStream cluster unable to write to WAL2[WRN] Raft heartbeat timeout, stepping down3[ERR] Disk full, JetStream disabledThese indicate the specific failure that caused the server to drop out of the meta group.
# Is the server process running?systemctl status nats-server
# Check available disk spacedf -h /path/to/jetstream/store
# Check system memoryfree -h
# Check for OOM killsdmesg | grep -i "out of memory"If the server crashed, restart it. The simplest fix — bring the server back:
systemctl start nats-serverThe server will rejoin the meta group automatically, catch up on missed Raft entries, and resume voting. Monitor the meta group report to confirm it transitions from Offline: true to Offline: false and the Lag drops to zero.
If disk is full, free space first. The server can’t rejoin the meta group if it can’t write Raft log entries:
# Check what's consuming spacedu -sh /path/to/jetstream/store/*
# If JetStream data is the culprit, consider purging low-priority streams# after the server is running againIf the server is permanently gone, remove its peer. Don’t leave ghost peers in the meta group — remove it via nats server cluster peer-remove. This requires the remaining members to have quorum:
nats server cluster peer-remove <server-name>This tells the meta group to stop expecting that peer. Quorum requirements adjust to the new, smaller group size. Note: this command can only succeed when the remaining peers still form a quorum — if quorum is already lost, you must restore enough peers first.
Use lame duck mode for planned maintenance. Before stopping a server for upgrades or maintenance, gracefully drain it:
nats-server --signal ldm=<server-name> # SIGUSR2 puts the server in lame-duck modeOr send a SIGINT/SIGTERM — the server enters lame duck mode, migrates leaders away, and shuts down cleanly. This ensures stream and meta leaders are moved before the server disappears.
Monitor server health proactively. Don’t wait for operators to notice an offline replica:
1// Go: monitor meta group health2nc, _ := nats.Connect("nats://localhost:4222",3 nats.UserInfo("sys_user", "sys_pass"),4)5
6// Request JetStream server info7resp, _ := nc.Request("$SYS.REQ.SERVER.PING.JSZ", nil, 2*time.Second)8// Parse and check for offline peers1# Python: check meta group via /jsz2import aiohttp3
4async def check_meta_replicas(monitor_urls: list[str]):5 async with aiohttp.ClientSession() as session:6 for url in monitor_urls:7 async with session.get(f"{url}/jsz") as resp:8 data = await resp.json()9 meta = data.get("meta_cluster", {})10 for peer in meta.get("replicas", []):11 if peer.get("offline"):12 print(f"ALERT: Meta peer {peer['name']} is offline")Ensure adequate disk headroom. Set storage reservations to leave at least 10-20% of total disk for Raft WAL, snapshots, and OS overhead. Monitor disk usage at the OS level, not just JetStream utilization.
Use odd-numbered cluster sizes of at least three. A three-node meta group tolerates one failure. A five-node group tolerates two. For production workloads where JetStream availability is critical, five nodes provides significantly better resilience during rolling upgrades and unexpected failures.
Automate server replacement. In orchestrated environments (Kubernetes, Nomad), configure your orchestrator to automatically replace failed NATS pods. Combined with peer removal for permanently lost servers, this keeps the meta group healthy without manual intervention.
Implement health-based alerting. Alert on the /healthz endpoint and on the meta group’s offline replica count.
Synadia Insights evaluates meta group health every collection epoch and fires a critical alert the moment any replica goes offline, giving you time to act before a second failure risks quorum loss.
Yes, as long as a quorum of peers remains online. In a three-node cluster, two of three peers maintain quorum. Existing streams and consumers continue operating normally. JetStream API requests are processed by the meta leader. The risk is that you’ve lost all fault tolerance — one more failure breaks quorum.
Typically seconds to a few minutes, depending on how much Raft log the server missed while offline. The server requests a snapshot from the leader and replays any log entries since that snapshot. For a server that was only offline briefly, this is nearly instant. For extended outages with high JetStream API activity, the catch-up may take longer.
If the server will be restored (planned maintenance, recoverable crash), wait. The server will rejoin automatically and catch up. If the server is permanently lost (hardware failure, decommissioned), remove it with nats server cluster peer-remove. Leaving a dead peer in the group permanently reduces your fault tolerance.
The meta group manages the JetStream control plane — stream and consumer placement, API operations, and cluster-wide state. A meta offline replica (META_001) affects the ability to manage JetStream. Stream offline replicas affect individual stream availability and replication. A meta outage is broader in impact because it blocks all JetStream management operations, while a stream replica outage only affects that specific stream.
After restoring the replica, check for replication lag:
nats server report jetstreamVerify the Lag column for the restored peer shows zero. Also check individual streams for replica lag with nats stream report. If the server was offline long enough that Raft logs were compacted, the leader will send a full snapshot — this is automatic and doesn’t require intervention.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community