Checks/SERVER_015

NATS Stream Recovery Failure: What It Means and How to Fix It

Severity
Critical
Category
Consistency
Applies to
Server
Check ID
SERVER_015
Detection threshold
STREAM or CONSUMER-type healthz error reported by server

A NATS server has failed to recover one or more streams or consumers from disk during startup — the healthz endpoint is reporting a STREAM or CONSUMER-type error, indicating that stored data could not be loaded and the affected JetStream assets are unavailable on this server.

Why this matters

JetStream persists stream messages and consumer state to disk. When a server starts, it replays the on-disk store to reconstruct each stream’s message log and each consumer’s delivery tracking. A recovery failure means something in the stored data is unreadable — a corrupt write-ahead log (WAL), a truncated message block, a damaged consumer state file, or a filesystem-level error. The server cannot reconstruct the stream or consumer, so it skips it and reports the health error.

For replicated streams (R > 1), a recovery failure on one replica does not immediately cause data loss. The other replicas still hold the complete data, and the stream continues operating. But the cluster is now running with one fewer replica than configured. If the stream is R3, it is now effectively R2 — one more failure away from quorum loss and stream unavailability. The recovery failure must be addressed promptly to restore the intended fault tolerance.

For R1 streams, recovery failure is an outage. The single copy of the stream’s data is on this server, and if it cannot be recovered, the stream is unavailable. Messages stored in the corrupted portion may be permanently lost. This is one of the strongest arguments for running replicated streams in production — R1 streams have no redundancy against storage failures.

Common causes

  • Corrupt write-ahead log (WAL) entries. An unclean server shutdown — power loss, OOM kill, kernel panic — can leave partially written WAL entries. The Raft log reader encounters an incomplete or invalid entry and cannot proceed past it.

  • Filesystem corruption. Underlying disk errors, failed RAID rebuilds, or filesystem bugs (especially on copy-on-write filesystems under high write load) can corrupt message block files or metadata.

  • Disk full during write. If the disk filled completely while the server was writing stream data, the final blocks may be truncated. The server wrote partial data before the OS returned an ENOSPC error.

  • Storage hardware failure. Failing drives with bad sectors, SSD firmware bugs, or storage controller errors can silently corrupt data at rest. This is especially dangerous because the corruption may not manifest until the next server restart when recovery reads the affected blocks.

  • Manual modification of the JetStream store directory. Someone or some automated process deleted, moved, or modified files in the JetStream data directory (jetstream/ under the configured store dir). The server expects a specific directory structure and file format — any external modification can break recovery.

  • Version mismatch in store format. Downgrading a NATS server to an older version after it has written data in a newer store format can cause recovery failures. The older binary does not understand the newer on-disk format.

How to diagnose

Check the healthz endpoint

Terminal window
nats server request healthz --name <server_name>

The response includes the specific asset type (STREAM or CONSUMER) and the name of the affected stream or consumer. Example:

1
status: error
2
error: "stream recovery failed: ORDERS (corrupt WAL)"

Identify affected streams

Terminal window
nats stream report

Compare the output against expected streams. Streams that should exist but show missing replicas or are absent entirely are likely the ones that failed recovery. Cross-reference with the healthz output.

Check server logs for recovery errors

Server logs contain detailed error messages about what went wrong during recovery:

1
[ERR] JetStream stream 'ORDERS' recovery failed: corrupt raft log at index 48291
2
[ERR] JetStream consumer 'ORDERS > order-processor' recovery failed: bad consumer state
3
[WRN] Skipping stream 'EVENTS': store directory read error

These messages pinpoint the exact stream, the type of corruption, and often the specific file or log index involved.

Inspect the store directory

On the affected server, examine the JetStream store directory:

Terminal window
# Default location
ls -la /path/to/jetstream/<account>/<stream>/
# Check for zero-byte files or unusual timestamps
find /path/to/jetstream/ -size 0 -name "*.blk"
find /path/to/jetstream/ -name "*.wal" -newer /tmp/reference_file

Zero-byte block files, missing index files, or WAL files with timestamps after an unclean shutdown are indicators of the corruption source.

Check disk health

Terminal window
# Check filesystem for errors (server must be stopped or partition unmounted)
# Linux
dmesg | grep -i "error\|fault\|bad"
# Check SMART data on the drive
smartctl -a /dev/sda

If disk hardware is failing, recovery failures will recur even after cleanup.

How to fix it

Immediate: assess the blast radius

Determine if replicas exist. For each affected stream, check whether other replicas are healthy:

Terminal window
nats stream info <stream_name>

If the stream has healthy replicas with full data, recovery on this server is not urgent for data availability — but it is urgent for fault tolerance.

For R1 streams with no other replicas, the data on this server is the only copy. Do not delete the store directory. Attempt manual recovery first.

Short-term: restore the affected streams

For replicated streams — remove and let the replica rebuild. The simplest fix for a replicated stream with a corrupt local replica is to remove the bad replica’s store data and let it resync from a healthy peer:

Terminal window
# Stop the server
nats-server --signal stop
# Remove the corrupt stream's local store
rm -rf /path/to/jetstream/<account>/streams/<stream_name>/
# Restart the server
nats-server -c /path/to/nats-server.conf

On restart, the server will detect the missing local data and resync the stream from a healthy replica. This is safe because the data exists on other replicas.

For R1 streams — attempt WAL truncation. If the corruption is at the tail of the WAL (common after unclean shutdown), you may be able to recover by truncating the corrupt entries. This requires stopping the server and using the nats-server tool:

Terminal window
nats-server --signal stop
# Back up the store directory first
cp -r /path/to/jetstream/<account>/streams/<stream_name>/ /path/to/backup/
# Restart — the server will attempt to truncate corrupt trailing entries
nats-server -c /path/to/nats-server.conf

Recent NATS server versions automatically truncate corrupt trailing WAL entries on startup. If this does not resolve the issue, the corruption is mid-log and manual intervention is required.

1
// Go - check stream health and replica status programmatically
2
nc, _ := nats.Connect(url)
3
js, _ := nc.JetStream()
4
5
info, err := js.StreamInfo("ORDERS")
6
if err != nil {
7
log.Printf("Stream unavailable: %v", err)
8
} else {
9
for _, r := range info.Cluster.Replicas {
10
if !r.Current {
11
log.Printf("Replica %s not current, lag: %d", r.Name, r.Lag)
12
}
13
}
14
}

Long-term: prevent recurrence

Use replicated streams for all production data. R1 streams have no protection against storage failures. Migrate critical R1 streams to R3:

Terminal window
nats stream edit <stream_name> --replicas 3

Enable filesystem journaling and integrity checking. Use ext4 with journaling or XFS for JetStream store directories. Avoid experimental or copy-on-write filesystems under heavy write workloads unless thoroughly tested.

Implement graceful shutdown procedures. Ensure servers receive SIGTERM and have time to flush pending writes before process termination. In Kubernetes, configure adequate terminationGracePeriodSeconds (at least 30 seconds for servers with many streams):

1
terminationGracePeriodSeconds: 60
2
lifecycle:
3
preStop:
4
exec:
5
command: ["nats-server", "--signal", "ldm"]

Monitor disk health proactively. Use SMART monitoring and filesystem error counters to detect failing drives before they corrupt JetStream data. Replace drives at the first sign of read/write errors.

Back up JetStream store directories. For critical R1 streams that cannot be replicated (single-server deployments), implement regular backups of the JetStream store directory. Snapshots or rsync during low-traffic periods provide a recovery point if corruption occurs.

Frequently asked questions

Can a stream recovery failure cause data loss?

For replicated streams (R > 1), no — the data exists on other replicas and the failed replica will resync. For R1 streams, yes — if the on-disk data is corrupt and unrecoverable, messages in the corrupted portions are lost. This is why R3 replication is strongly recommended for any stream where data loss is unacceptable.

Will the server automatically retry recovery?

No. A failed recovery is final for that server startup. The server marks the stream as failed and continues operating without it. To retry recovery, you must restart the server. If the underlying corruption is not fixed, the retry will fail again.

How do I know if corruption is in the messages or the Raft log?

The server log messages distinguish between them. “corrupt raft log” or “bad WAL entry” indicates Raft log corruption. “corrupt message block” or “bad blk file” indicates message store corruption. Raft log corruption is more common after unclean shutdowns because the WAL is the most actively written file.

Should I delete the entire JetStream store directory?

Only for replicated streams where other replicas are healthy. Deleting the store for an R1 stream destroys the only copy of the data. For replicated streams, deleting the local store is the fastest path to recovery — the server resyncs from peers on restart. Always back up before deleting.

Does this check trigger during normal startup?

If recovery succeeds, even if it takes several minutes, this check does not trigger. It only fires when recovery explicitly fails — meaning the server attempted to recover a stream or consumer and could not. Slow recovery (which is normal for large datasets) triggers SERVER_014 (JetStream Subsystem Unhealthy) during the recovery window, but not this check.

Proactive monitoring for NATS stream recovery failure with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel