Checks/OPT_SYS_009

NATS Leaderless Raft Group: What It Means and How to Fix It

Severity
Critical
Category
Health
Applies to
System Improvement
Check ID
OPT_SYS_009
Detection threshold
Raft group has no elected leader — detected within 10 seconds of quorum loss (leader = '' AND ever_had_leader = true)

A leaderless Raft group is a stream, consumer, or meta cluster group that previously had a leader but currently has none. Without a leader, the group cannot process any writes — stream publishes stall, consumer deliveries stop, or JetStream API calls fail, depending on which group is affected.

Why this matters

Every replicated JetStream asset operates through Raft consensus. The leader coordinates all writes: accepting published messages for streams, tracking acknowledgments for consumers, or managing asset placement for the meta group. When a group has no leader, it has no coordinator — all operations that require consensus are blocked.

A brief leaderless state during a leader election is normal. Elections happen in milliseconds after a leader steps down or fails. What’s abnormal is a persistent leaderless state — the group can’t complete an election and remains stuck. This check specifically targets groups that had a leader before (ruling out newly created groups that haven’t elected yet), meaning something went wrong with the election process.

The impact depends on which Raft group is leaderless. A leaderless stream group means publishes to that stream return errors or time out. A leaderless consumer group means message delivery stops entirely for that consumer. A leaderless meta group is the worst case — all JetStream API operations across the entire cluster fail. In each case, the group is effectively frozen until a leader is elected.

Common causes

  • Election loop from network instability. Raft candidates start elections by requesting votes from peers. If network latency is high or packets are being dropped, vote requests and responses time out before reaching quorum. The election restarts, times out again, and the group cycles without ever electing a leader.

  • All candidates have stale Raft logs. Raft requires that a leader’s log be at least as up-to-date as a quorum of peers. If replicas were partitioned and diverged, candidates may reject each other’s vote requests because each candidate’s log is missing entries the others have. This is rare but can happen after complex failure/recovery sequences.

  • Server overload causing election timeouts. Raft elections have timeout windows. If the servers hosting the Raft group are under heavy CPU or I/O load, the election process can’t complete within the timeout. The election resets, but the load that caused the timeout hasn’t changed — creating a loop.

  • Disk I/O blocking Raft operations. Raft writes proposals and votes to disk before responding. If disk I/O is saturated (from stream writes, snapshots, or other processes), the election protocol can’t persist state fast enough to complete within timeouts.

  • Corrupt Raft state. Rare, but possible after unclean shutdowns, disk errors, or filesystem corruption. A replica with corrupt state may participate in elections but fail to process the result, preventing the group from settling on a leader.

  • Bug in Raft implementation. Very rare in stable releases but possible, especially in edge cases with specific timing or failure patterns. If other causes are ruled out, this may warrant a bug report to the nats-server repository.

How to diagnose

Check if the leaderless state is transient

A leaderless group is detected within 10 seconds of quorum loss. Wait 30 seconds after detection. Transient leaderless states during routine elections resolve quickly. If the group is still leaderless after 30 seconds, the election is stuck.

Identify the leaderless group

Terminal window
# For streams — Leader field will be empty
nats stream info <stream-name>
# For consumers
nats consumer info <stream-name> <consumer-name>
# For the meta group
nats server report jetstream

The Leader field will be blank or empty for a leaderless group. The replica list will show peers but none designated as leader.

Check cluster connectivity

Terminal window
nats server list

Verify all servers in the Raft group are online and visible. Ensure a quorum of peers is online and reachable. If a server is missing, the group may lack quorum to elect a leader — this is more accurately a quorum loss issue (JETSTREAM_008 or CONSUMER_003) than a leaderless issue. Check server logs for election failures or network partition indicators.

Watch for election activity

Terminal window
nats event --js-advisory

Leader election advisories on $JS.EVENT.ADVISORY.STREAM.LEADER_ELECTED and $JS.EVENT.ADVISORY.CONSUMER.LEADER_ELECTED fire when elections succeed. If you see no election events for the affected group, elections are either not starting or not completing.

Check server resource pressure

Terminal window
# Check CPU and connections
nats server list
# Check disk I/O on the servers hosting the group
iostat -xz 1 5

High CPU (see SERVER_003) or high disk I/O latency can prevent elections from completing within Raft’s timeout windows.

Look at server logs

Server logs will show Raft election activity — vote requests, vote grants, election timeouts. Repeated election timeout messages indicate the election loop:

1
[WRN] JetStream cluster - Loss of stream quorum for 'ORDERS'
2
[INF] JetStream cluster - Stream 'ORDERS' leader election started

How to fix it

Immediate: force a new election

Step down the group to trigger a fresh election. Even though the group is leaderless, issuing a step-down request can reset election state and break out of a stuck cycle:

Terminal window
# For streams
nats stream cluster step-down <stream-name>
# For consumers
nats consumer cluster step-down <stream-name> <consumer-name>

If the step-down command fails (no leader to receive it), try restarting one of the servers in the Raft group. A server restart resets its local Raft state and triggers a new election round.

Short-term: fix the underlying cause

Resolve network issues between Raft peers. If elections are failing due to network timeouts, fix the network path:

Terminal window
# Check RTT between cluster servers
nats server list

RTT between cluster peers should be consistently under 10ms in the same datacenter. High or variable RTT causes election timeouts.

Reduce server load. If CPU or disk I/O pressure is preventing elections from completing:

Terminal window
# Check for hot subjects or high connection counts
nats server report connections --sort out-msgs

Consider temporarily reducing workload (pausing publishers, draining non-critical connections) to give the election process room to complete.

Check and repair disk I/O. If disk latency is the bottleneck, validate the storage health:

Terminal window
# Check disk performance
fio --name=write-test --rw=write --bs=4k --size=100M --runtime=10 --filename=/path/to/jetstream/store

If the JetStream store_dir is on slow storage, migrating to SSD is essential for stable Raft operations.

Long-term: prevent leaderless states

Use fast, dedicated storage for JetStream. Raft’s election protocol depends on timely disk writes. SSDs with consistent latency prevent election timeouts caused by I/O spikes:

nats-server.conf
1
jetstream {
2
store_dir: "/fast-ssd/nats/jetstream"
3
}

Monitor Raft group health continuously. Set up alerts for leaderless groups so you catch stuck elections before users notice:

1
// Go: check stream leader status
2
nc, _ := nats.Connect(url)
3
js, _ := nc.JetStream()
4
info, _ := js.StreamInfo("ORDERS")
5
if info.Cluster.Leader == "" {
6
log.Printf("ALERT: stream ORDERS has no leader")
7
}
1
# Python: monitor for leaderless groups
2
import nats
3
4
async def check_leaders():
5
nc = await nats.connect()
6
js = nc.jetstream()
7
info = await js.stream_info("ORDERS")
8
if not info.cluster.leader:
9
print(f"ALERT: stream ORDERS is leaderless")

Ensure cluster sizes are odd. Even-numbered clusters risk split-vote scenarios where two candidates each get exactly half the votes and neither achieves majority. Odd-numbered clusters eliminate this:

1
# Use 3 or 5 servers, never 2 or 4

Maintain headroom on server resources. Raft elections need CPU time and disk bandwidth. If servers routinely run at 90%+ CPU or disk utilization, elections during disruptions are more likely to time out.

Synadia Insights detects leaderless Raft groups automatically and distinguishes transient election states from persistent leadership failures, alerting only when a group that previously had a leader remains leaderless.

Frequently asked questions

What’s the difference between a leaderless Raft group and quorum loss?

Quorum loss (JETSTREAM_008, CONSUMER_003) means not enough replicas are online to form a majority. A leaderless group may have enough replicas online but can’t complete an election — all peers are present, but they can’t agree on a leader. Quorum loss is a membership problem; leaderless is an election problem. In practice, quorum loss always causes leaderlessness, but leaderlessness can occur with full membership.

Is a brief leaderless state during failover normal?

Yes. When a leader fails or steps down, the group is briefly leaderless while a new election runs. This typically lasts milliseconds to low-single-digit seconds. Clients may see a brief timeout on publishes or fetches during this window. This check only fires for groups that remain leaderless beyond the normal election window, and only for groups that previously had a leader.

Can a leaderless stream accept publishes?

No. Without a leader, the stream has no coordinator to accept and replicate messages. Publish requests to a leaderless stream will return a “no responders” error or time out. The NATS client will surface this as a publish error. If using nats.PublishAsync, the pending ack future will fail.

How do I tell which Raft group is leaderless if I have hundreds of streams?

Use nats stream report to get a summary of all streams with their leader status. Leaderless streams will show no leader in the cluster column. For consumers, nats consumer report --all provides similar visibility. Synadia Insights provides a single-pane view of all Raft groups and their leadership status.

Should I recreate the stream or consumer if it stays leaderless?

As a last resort, yes. If a Raft group is persistently leaderless despite all peers being online and healthy, there may be corrupted Raft state. Delete and recreate the stream or consumer from configuration. For streams, ensure you have a backup or mirror to recover data. For consumers, recreating resets the delivery position — use the opt_start_seq or opt_start_time option to resume from the appropriate position.

Proactive monitoring for NATS leaderless raft group with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel