Checks/SERVER_014

NATS JetStream Subsystem Unhealthy: What It Means and How to Fix It

Severity
Critical
Category
Health
Applies to
Server
Check ID
SERVER_014
Detection threshold
JETSTREAM-type healthz error reported by server

The JetStream subsystem on a NATS server is reporting unhealthy — it has lost contact with the meta leader, is not current with the Raft log, or is still recovering streams and consumers from disk after a restart.

Why this matters

Every JetStream operation — creating streams, publishing to streams, consuming messages, updating configurations — requires the server to participate in a functioning JetStream cluster. When the JetStream subsystem is unhealthy, the server cannot serve any JetStream API requests. Publishes to streams hosted on this server will fail or time out. Consumers assigned to this server will stop delivering messages. If the server was the stream or consumer leader, leadership must transfer to another replica — and if quorum is already marginal, that transfer may not succeed.

The healthz endpoint is the server’s self-assessment of its operational readiness. A JETSTREAM-type healthz error means the server itself has determined that its JetStream layer is not functioning. This is different from an external observation of degraded performance — the server is explicitly declaring it cannot participate. Load balancers, orchestrators, and monitoring systems that check the healthz endpoint will correctly mark this server as unavailable, but clients already connected may experience timeouts and errors until they failover.

In clustered deployments, a single server with an unhealthy JetStream subsystem reduces the cluster’s capacity and fault tolerance. If the cluster is running R3 streams and one server’s JetStream is down, those streams are operating at R2 — one more failure away from quorum loss. Extended JetStream unhealthiness on multiple servers simultaneously can cascade into full JetStream unavailability, halting all persistent messaging workloads.

Common causes

  • No contact with the meta leader. The server cannot reach the meta leader node that coordinates JetStream cluster operations. This typically indicates a network partition, the meta leader being down, or the meta group having lost quorum entirely.

  • Server is not current with the Raft log. After a restart or network partition, the server needs to catch up on Raft log entries from the meta leader. Until it is current, it reports as unhealthy. Large clusters with many streams and consumers may take significant time to replay the log.

  • Stream and consumer recovery still in progress. On startup, the server must recover all locally stored streams and consumers from disk. Servers with hundreds of streams or large Raft WAL files may take minutes to complete recovery. The server reports unhealthy until this process finishes.

  • Disk I/O saturation. Slow disk performance — from hardware degradation, noisy neighbors on shared storage, or filesystem-level issues — can stall stream recovery and Raft log replay, keeping the subsystem in an unhealthy state far longer than expected.

  • Clock skew between cluster members. Significant clock drift between servers can interfere with Raft leader election timeouts and heartbeat intervals, causing the server to lose contact with the meta leader or fail to participate in consensus.

  • Server was removed from the cluster but not reconfigured. If a server’s peer was removed from the meta group (via nats server cluster peer-remove) but the server is still running with JetStream enabled, it will continuously fail to join the meta group and report unhealthy.

How to diagnose

Check the server healthz endpoint

The healthz endpoint gives a direct readout of what is unhealthy and why:

Terminal window
nats server request healthz --name <server_name>

Look for the error field in the response. Common messages include:

  • "JetStream not current" — Raft log replay is in progress
  • "JetStream no meta leader" — Cannot contact or elect a meta leader
  • "JetStream stream recovery incomplete" — Still recovering streams from disk

Check JetStream cluster status

Terminal window
nats server report jetstream

This shows each server’s JetStream state including whether it is online, its storage usage, and the number of streams and consumers it hosts. Servers with an unhealthy JetStream subsystem may show as offline or with missing stream counts.

Check meta group status

Terminal window
nats server report jetstream --all

Verify that the meta leader is elected and that the unhealthy server appears in the peer list. If the server is missing from the peer list, it may have been removed or may not be connecting to the cluster.

Check server logs

Server logs will contain entries explaining the JetStream health failure:

1
[WRN] JetStream health check failed: no meta leader
2
[INF] JetStream stream recovery: 142/350 streams recovered
3
[ERR] JetStream not able to contact meta leader

These entries clarify whether the issue is leadership, recovery progress, or connectivity.

Check system resource pressure

Terminal window
# Check disk I/O on the server host
nats server info # OS-level metrics (CPU, disk) come from `top`/`iostat` on <server_name>

If recovery is stalled, verify disk latency and throughput on the affected server. Slow storage is the most common reason for extended recovery times.

How to fix it

Immediate: determine the root cause

Check if it is just recovery time. If the server recently restarted, the unhealthy status during recovery is expected. Monitor the server logs for recovery progress. A server with many streams may legitimately need several minutes:

Terminal window
# Watch server logs for recovery progress
nats server request healthz --name <server_name>

Re-check periodically. If the stream count in the recovery log messages is progressing, the server is recovering normally and will transition to healthy once complete.

Verify meta leader availability. If the error is “no meta leader,” the problem is cluster-wide, not specific to this server:

Terminal window
nats server report jetstream

If no meta leader is elected, address that first — see the Meta Quorum Lost check. This server cannot become healthy until the meta group has a functioning leader.

Short-term: restore JetStream health

Restart the server if recovery is stalled. If the server has been in recovery for an unusually long time with no progress in the logs, a restart can reset the recovery process:

Terminal window
nats-server --signal quit # SIGTERM equivalent for graceful shutdown
# Or via system signal
nats-server --signal reload

Fix network connectivity. If the server cannot reach the meta leader, verify network connectivity between cluster members:

Terminal window
# Check route connections from the unhealthy server
nats server report connections --host <server_name> --sort rtt

Ensure firewall rules, security groups, and DNS resolution allow the server to reach all cluster peers on the configured cluster port.

Remove and re-add a stuck peer. If a server was improperly removed from the meta group or has corrupted Raft state, clean its JetStream state and let it rejoin from scratch:

Terminal window
# 1. Confirm the peer is stuck — non-current with no recovery progress between intervals
nats server report jetstream
# 2. Remove the stuck peer from the meta group (run from a healthy server with system account access)
nats server raft peer-remove <server_name>
# 3. Stop the affected server and clean its JetStream store before restarting so it rejoins clean
systemctl stop nats-server
rm -rf /data/jetstream/_meta_/* /data/jetstream/$G/_raft_/*
systemctl start nats-server

After the server starts, monitor nats server report jetstream until it shows current and JetStream-enabled. The meta leader will assign it stream and consumer replicas as quorum allows.

Watch JetStream health programmatically. During remediation, programmatically verify when the subsystem comes back online from a client — account_info() succeeds only when the JetStream subsystem is healthy enough to serve API calls:

1
// Go — poll JetStream health from a client
2
nc, _ := nats.Connect(url)
3
js, _ := nc.JetStream()
4
5
info, err := js.AccountInfo()
6
if err != nil {
7
log.Printf("JetStream unavailable: %v", err)
8
// Trigger alerting or failover logic
9
}
1
# Python — check JetStream availability
2
import nats
3
4
async def check_js_health():
5
nc = await nats.connect(server_url)
6
js = nc.jetstream()
7
try:
8
info = await js.account_info()
9
print(f"JetStream OK: {info.streams} streams")
10
except Exception as e:
11
print(f"JetStream unhealthy: {e}")
12
await nc.close()

Long-term: prevent recurrence

Right-size the number of streams per server. Servers hosting hundreds of streams will have long recovery times after every restart. Distribute streams across more servers or consolidate small streams to reduce per-server stream count. Target recovery times under 60 seconds by keeping stream counts manageable.

Use fast local storage. JetStream recovery time is dominated by disk I/O. NVMe SSDs dramatically reduce recovery time compared to spinning disks or network-attached storage. Avoid shared storage where noisy neighbors can spike latency during recovery.

Implement healthz-based readiness probes. In Kubernetes or load-balancer configurations, use the /healthz endpoint as a readiness check so that traffic is not routed to servers that are still recovering:

1
# Kubernetes readiness probe
2
readinessProbe:
3
httpGet:
4
path: /healthz
5
port: 8222
6
initialDelaySeconds: 10
7
periodSeconds: 5
8
failureThreshold: 12

Monitor JetStream health continuously. Synadia Insights evaluates JETSTREAM-type healthz errors every collection epoch, catching subsystem failures within seconds. Configure alerting on this check to respond before unhealthy servers impact stream quorum.

Maintain adequate cluster sizing. Run at least 3 JetStream-enabled servers (5 for large deployments) so that one server in recovery does not put stream quorum at risk. With R3 replication and 5 servers, one server recovering still leaves R3 quorum intact for all streams.

Frequently asked questions

How long should JetStream recovery take after a restart?

Recovery time depends on the number of streams, consumers, and the size of Raft WAL files on disk. A server with 10-20 streams on NVMe storage typically recovers in under 10 seconds. A server with 200+ streams on slower storage may take 2-5 minutes. If recovery exceeds 10 minutes, investigate disk I/O or corrupted store files.

Does an unhealthy JetStream subsystem affect core NATS?

No. Core NATS publish/subscribe, request/reply, and queue groups operate independently of JetStream. Clients using only core NATS are unaffected by JetStream subsystem health. Only JetStream operations — stream publishes, consumer fetches, stream management APIs — are impacted.

Can I force a server to skip recovery and start healthy?

No. The server must complete stream recovery to ensure data consistency. Skipping recovery would mean the server has streams and consumers in an unknown state, which would cause data loss or inconsistency when it participates in Raft consensus. If recovery is stuck, address the underlying disk or corruption issue rather than trying to bypass it.

What happens to messages published during the unhealthy period?

For replicated streams (R > 1), publishes are handled by the remaining healthy replicas. The stream leader will be on another server, and publishes continue normally as long as quorum is maintained. The recovering server catches up on missed Raft entries after it becomes healthy. For R1 streams hosted on the unhealthy server, publishes will fail with a timeout or unavailable error until the server recovers.

How do I tell if the problem is this server or the whole cluster?

Run nats server report jetstream and check the meta leader status. If a meta leader is elected and other servers show healthy JetStream, the problem is isolated to this server (likely recovery or connectivity). If no meta leader is elected or multiple servers show issues, the problem is cluster-wide — start with the Meta Quorum Lost check instead.

Proactive monitoring for NATS jetstream subsystem unhealthy with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel