Checks/META_006

NATS Meta Quorum Lost: What It Means and How to Fix It

Severity
Critical
Category
Health
Applies to
Meta Cluster
Check ID
META_006
Detection threshold
Offline meta cluster peers >= quorum needed

The meta cluster has lost quorum — enough peers are offline that the remaining servers cannot form a Raft majority. Without meta quorum, the JetStream API is completely stalled: no stream creation, no consumer creation, no configuration changes, and no placement decisions. This is a cluster-wide JetStream control plane outage.

Why this matters

The meta cluster is the brain of JetStream. It manages the registry of all streams and consumers, handles placement decisions (which server hosts which replica), processes every JetStream API request, and coordinates configuration changes across the cluster. All of this runs on Raft consensus, which requires a quorum — a strict majority of peers — to commit any operation.

When quorum is lost, no new Raft entries can be committed. The meta leader (if one was elected) cannot make progress because it can’t get acknowledgments from enough peers. In practice, this means the JetStream API hangs. Any client calling js.AddStream(), js.AddConsumer(), js.StreamInfo(), or similar operations will time out. Kubernetes operators, CI/CD pipelines, and automation that creates JetStream resources all stall. The cluster cannot self-heal — it can’t rebalance replicas, reassign leaders, or respond to topology changes.

The critical nuance: existing data streams may continue operating. If a stream’s own Raft group (separate from the meta group) still has quorum — say two of three stream replicas are on servers that are still running — that stream can still accept publishes and deliver messages to consumers. But you can’t modify it, can’t create new consumers for it, and can’t query its full state through the JetStream API. The system is in a read/write-for-existing-streams-only mode with a frozen control plane. This is a dangerous state because any additional failure (a stream losing its own quorum, a consumer needing reassignment) cannot be handled.

Common causes

  • Multiple servers down simultaneously. The most direct cause. In a three-node cluster, losing two servers means only one remains — well below the quorum of two. This can happen during aggressive rolling restarts, concurrent hardware failures, or a shared infrastructure failure (same rack, same power circuit, same hypervisor host).

  • Network partition splitting the cluster below quorum. A network failure isolates enough servers that no group of peers can form a majority. In a three-node cluster, a partition that isolates each server individually means no group has two peers — quorum is impossible on any side.

  • Aggressive rolling restart. The most common operational cause. If you restart server A before server B (which was restarted earlier) has fully rejoined the meta group, you have two servers simultaneously out of the meta group. In a three-node cluster, that’s quorum lost. This happens when restart scripts don’t wait for meta group health between restarts.

  • Shared infrastructure failure. Two of three servers running on the same cloud availability zone, hypervisor host, or network switch. When that shared resource fails, multiple servers go down simultaneously. This defeats the purpose of replication.

  • Cascading resource exhaustion. One server goes down due to disk full. The increased load on the remaining servers causes a second server to hit resource limits (memory, CPU, disk I/O), and it either crashes or falls out of the meta group due to missed Raft heartbeats.

How to diagnose

Confirm meta quorum is lost

Terminal window
nats server report jetstream

The Raft Meta Group table shows each peer’s state. Count the servers showing Offline: true. If offline peers >= ⌈(cluster_size)/2⌉, quorum is lost. For a three-node cluster, two offline peers mean no quorum. For five nodes, three offline means no quorum.

If the nats CLI can’t connect to any server, or JetStream requests time out, that itself is strong evidence of quorum loss.

Check the health endpoint

Terminal window
curl -s http://localhost:8222/healthz

When meta quorum is lost, the health endpoint returns an error indicating JetStream is unavailable. The response includes specific details about the failure.

For a targeted JetStream health check:

Terminal window
nats server check jetstream

This will report the meta group state and whether quorum is maintained.

Identify which servers are down

Terminal window
nats server list

Compare the listed servers against your expected cluster membership. Missing servers are the ones that need to be restored.

If you can’t reach the cluster via the nats CLI, check each server’s monitoring endpoint directly:

Terminal window
# Check each server individually
curl -s http://<server-1>:8222/healthz
curl -s http://<server-2>:8222/healthz
curl -s http://<server-3>:8222/healthz

Check why servers are down

For each offline server:

Terminal window
# Is the process running?
systemctl status nats-server
# Check for OOM kills
dmesg | grep -i "out of memory"
# Check disk space
df -h /path/to/jetstream/store
# Check server logs for the shutdown reason
journalctl -u nats-server --since "1 hour ago" | tail -50

How to fix it

Immediate: restore quorum

Priority one: bring servers back online. This is the only way to restore meta quorum. Every second without quorum is a second where the JetStream control plane is frozen:

Terminal window
# Start the NATS server
systemctl start nats-server
# Verify it's running
systemctl status nats-server
# Check if it rejoined the meta group
curl -s http://localhost:8222/jsz | jq '.meta_cluster'

You need to restore enough servers to reach quorum. For a three-node cluster, bring at least one of the two offline servers back. For five nodes, restore at least one of the three offline servers (to get back to three of five).

If a server can’t start due to disk full:

Terminal window
# Free space — remove old logs, temp files, etc.
# DO NOT delete JetStream store data unless you understand the consequences
journalctl --vacuum-time=1h
find /tmp -type f -mtime +1 -delete
# Then start the server
systemctl start nats-server

If a server is permanently lost, add a replacement. Deploy a new NATS server with the same cluster configuration and JetStream enabled. It will join the meta group as a new peer. The existing meta leader will send it a snapshot, and it will catch up on Raft state:

1
jetstream {
2
store_dir: /data/jetstream
3
max_file_store: 100G
4
}
5
6
cluster {
7
name: "production"
8
routes: [
9
"nats://server-1:6222"
10
"nats://server-2:6222"
11
"nats://server-3:6222"
12
]
13
}

If servers are permanently lost and quorum cannot be restored normally, you must manually recover the meta group. Stop all remaining meta servers, remove the failed peer’s Raft WAL state from the surviving servers’ JetStream store directories, and restart to re-bootstrap the meta group:

Terminal window
# On each surviving server, remove the failed peer's WAL state
# The exact path depends on your store_dir configuration
ls /data/jetstream/meta/ # Identify the peer directories
# Then restart all surviving servers
systemctl restart nats-server

Alternatively, if quorum can still be formed (e.g., you lost one of five servers), remove the peer entry:

Terminal window
nats server cluster peer-remove <lost-server-name>

This adjusts the meta group size downward, reducing the quorum requirement. In a three-node cluster where one server is permanently lost, removing the peer makes it a two-node group (quorum = 2) — which means the remaining two servers can form quorum. But a two-node meta group has zero fault tolerance, so add a replacement server promptly.

Short-term: stabilize the cluster

Wait for full meta group recovery before making changes. After restoring quorum, don’t immediately start creating streams or making configuration changes. Let the meta group fully synchronize:

Terminal window
# Monitor until all peers show Offline: false and Lag: 0
nats server report jetstream

If the outage was caused by rolling restarts, fix the procedure. Always verify meta group health between server restarts:

Terminal window
# Restart procedure: one server at a time
systemctl restart nats-server # on server-1
# Wait for it to rejoin and catch up
watch -n 2 'nats server report jetstream 2>/dev/null | head -20'
# Only proceed to next server when Offline: false and Lag: 0
systemctl restart nats-server # on server-2
# ... repeat

Long-term: prevent quorum loss

Use five-node clusters for critical workloads. A five-node meta group tolerates two simultaneous failures. This gives you room to lose one server unexpectedly and still have fault tolerance during a rolling restart of another:

1
// Go: verify cluster health before proceeding with operations
2
nc, _ := nats.Connect("nats://localhost:4222",
3
nats.UserInfo("sys_user", "sys_pass"),
4
)
5
js, _ := nc.JetStream()
6
7
// Check JetStream availability before critical operations
8
info, err := js.AccountInfo()
9
if err != nil {
10
log.Fatalf("JetStream unavailable (possible quorum loss): %v", err)
11
}
12
log.Printf("JetStream healthy: %d streams, %d consumers",
13
info.Streams, info.Consumers)
1
# Python: health check before JetStream operations
2
import nats
3
4
async def ensure_js_available():
5
nc = await nats.connect()
6
js = nc.jetstream()
7
8
try:
9
info = await js.account_info()
10
print(f"JetStream healthy: {info.streams} streams")
11
except Exception as e:
12
print(f"JetStream unavailable (possible quorum loss): {e}")
13
raise

Distribute servers across failure domains. Never place two meta group peers on the same rack, hypervisor host, or availability zone. Use placement constraints in your orchestrator to ensure physical separation.

Implement automated health monitoring. Alert before quorum is lost — when the first replica goes offline (META_001), you have time to act.

Synadia Insights evaluates meta quorum health every collection epoch. It alerts on offline replicas (META_001) as an early warning and on quorum loss (META_006) as a critical event, giving operators the maximum time window to intervene.

Frequently asked questions

Do existing streams and consumers keep working without meta quorum?

Yes, with caveats. Streams whose own Raft groups have quorum continue accepting publishes and delivering messages. Consumers continue processing. But nothing can be changed — no new streams, consumers, or configuration updates. If any stream loses its own quorum during the meta outage, it can’t be recovered until meta quorum is restored. The cluster is operating on borrowed time.

How long can a cluster survive without meta quorum?

Indefinitely for existing, healthy workloads. But the risk compounds over time. Any stream that loses quorum can’t be repaired. Any consumer that needs reassignment is stuck. Any new workload can’t be deployed. The longer the outage, the more likely a secondary failure creates an unrecoverable situation.

Can I force a single remaining server to become the leader?

Not safely through normal operations. Raft requires a quorum by design — this is a safety guarantee, not a limitation. If you could force a single server to become leader, you’d risk split-brain if the other servers are actually still running on the other side of a partition. If servers are truly permanently lost, remove them with nats server cluster peer-remove to reduce the group size and allow the remaining servers to form quorum.

What’s the difference between meta quorum lost and stream quorum lost?

Meta quorum lost (META_006) affects the JetStream control plane — API operations, stream management, consumer assignment. Stream quorum lost (JETSTREAM_008) affects a specific stream’s data plane — that stream can’t accept writes or elect a leader. Meta quorum loss is broader in scope but doesn’t necessarily affect data flow for healthy streams. Stream quorum loss is narrower but directly causes data unavailability for the affected stream.

How do I do rolling restarts without losing quorum?

Restart one server at a time. After each restart, run nats server report jetstream and wait until the restarted server shows Offline: false and Lag: 0 in the meta group table. Only then proceed to the next server. For a three-node cluster, this means you never have more than one server down simultaneously. For added safety, use lame duck mode (kill -SIGTERM <pid>) to gracefully drain the server before stopping it.

Proactive monitoring for NATS meta quorum lost with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel