NATS Raft WAL Size Excessive: Preventing Disk and Memory Cascades

The Raft write-ahead log (WAL) is the durability mechanism for JetStream’s consensus protocol. Every stream mutation — message publish, consumer ack, metadata change — is first written to the WAL before being applied. Normally, the WAL compacts automatically as entries are committed across replicas and applied to state. When compaction stalls, the WAL grows without bound. An excessively large WAL is a ticking time bomb: it consumes disk, causes cascading OOM failures on restart, and can render a stream unrecoverable without intervention.

Why this matters

The failure cascade from an unbounded WAL is one of the most severe failure modes in NATS JetStream deployments. Here’s how it unfolds:

Disk exhaustion. The WAL grows until it fills the available disk. Once disk is full, the server can’t write to any stream or WAL on that volume, affecting all JetStream assets on the node — not just the one with the bloated WAL.
Memory spike on restart. When a server restarts, it replays the WAL to rebuild in-memory state. A 50 GiB WAL means loading 50 GiB of log entries into memory during recovery. If the server doesn’t have enough RAM, it OOMs during startup.
Restart loop. The server OOMs, gets restarted by the process manager, tries to replay the same WAL, OOMs again. Without intervention, the node is permanently stuck.
Quorum impact. If the node is part of an R3 stream, the remaining two replicas are now running as an R2 group. If a second node hits the same issue (which is likely if the root cause affects all replicas), the stream loses quorum entirely.

The WAL is separate from the stream’s message storage. A stream with max_bytes=1GB can have a WAL that’s 10x larger if compaction has stalled. This means WAL growth can exhaust disk even when individual stream limits are properly configured.

Common causes

Stalled follower preventing log truncation. The Raft leader can’t truncate the WAL past the oldest uncommitted entry across all followers. If one follower is disconnected, slow, or stuck, the leader’s WAL retains all entries since the follower last acknowledged. In severe cases — a follower down for days — this means the entire WAL from that point forward is retained.
No active consumers advancing the commit index. For some Raft group types, the commit index advances as consumers acknowledge messages. If a stream has no active consumers, the WAL accumulates entries that are never compacted because the state machine’s applied index doesn’t advance.
Raft group with no leader. A leaderless Raft group can’t perform compaction. Entries accumulate on all replicas but no compaction or snapshotting occurs. Check for leaderless groups (OPT_SYS_009) as a root cause.
Disk I/O contention slowing compaction. If the underlying storage is slow (saturated IOPS, degraded RAID array, noisy neighbor on shared storage), WAL writes succeed but compaction can’t keep up. The WAL grows faster than it shrinks.
Large message payloads. Streams receiving messages with large payloads (>100 KB each) generate proportionally larger WAL entries. Combined with any compaction delay, the WAL grows rapidly in absolute terms.
Snapshot failures. Raft periodically creates snapshots to allow WAL truncation. If snapshotting fails (due to disk space, I/O errors, or internal errors), the WAL can’t be truncated and grows indefinitely.

How to diagnose

Check WAL sizes across the cluster

# List Raft group sizes (requires server access)
nats server report jetstream --json | jq '.[] | {server: .name, storage_used: .stats.store, reserved_storage: .stats.reserved_storage}'

For direct filesystem inspection on the server node:

# Check WAL directory sizes
du -sh /path/to/jetstream/*/streams/*/raft/

WAL files are stored in the raft/ subdirectory of each stream. Compare the WAL size to the stream’s actual message storage.

Identify which Raft groups are affected

# Check stream health and look for peers catching up
nats stream report

Look for streams showing replicas in catching up state for extended periods — these are the most likely to have WAL accumulation on the leader.

Check for stalled replicas

nats stream info MY_STREAM --json | jq '.cluster.replicas[] | {name: .name, current: .current, lag: .lag, active: .active}'

A replica with current: false and a high lag value indicates a stalled follower that’s preventing WAL truncation on the leader.

Monitor disk usage trends

# Check JetStream storage usage
nats server report jetstream

# Compare reserved vs. used — WAL growth shows as used > expected

If storage used is significantly higher than the sum of all stream max_bytes settings, WAL growth is likely the cause.

Check server logs for compaction errors

grep -i "raft\|snapshot\|compact\|wal" /var/log/nats/nats-server.log | tail -50

Look for errors related to snapshot creation, compaction failures, or disk I/O issues.

How to fix it

Immediate: prevent disk exhaustion

If disk usage is critical, identify and address the largest WAL first:

# Find the largest WAL directories
find /path/to/jetstream -name "raft" -type d -exec du -sh {} \; | sort -rh | head -10

Do not manually delete WAL files. This will corrupt the Raft state and can cause data loss. WAL recovery must be handled through the NATS server’s own mechanisms.

Restore stalled followers

If a follower is stalled, restoring it allows the leader to truncate the WAL:

# Check whether the peer is actually running. `nats server ping` pings every
# server it can reach (the --id flag only toggles whether server IDs appear in
# the output — it does not select a peer). To verify a specific peer:
nats server check connection --name <peer_name>

# If the peer is up but a stream replica is stalled, force a stream peer
# removal — the cluster will replace it according to placement policy
nats stream cluster peer-remove MY_STREAM <stalled_peer>

After removing the stalled peer, the leader can truncate the WAL entries that were waiting for that follower. Then re-add the peer:

nats stream edit MY_STREAM --replicas 3

The new replica will catch up via snapshot transfer rather than WAL replay, avoiding the accumulated WAL issue.

Force a Raft snapshot

If the WAL is large but the stream is otherwise healthy, a leader step-down can trigger snapshot creation:

nats stream cluster step-down MY_STREAM

The new leader will create a fresh snapshot as part of the leadership transition, allowing WAL truncation.

Handle the restart loop

If a server is caught in an OOM restart loop due to WAL replay:

Increase available memory temporarily — if running in a container, increase the memory limit
Set GOMEMLIMIT — this helps Go’s GC operate more efficiently during WAL replay:
Terminal window
```
GOMEMLIMIT=8GiB nats-server -c nats-server.conf
```
Move the affected stream’s Raft directory — as a last resort, move the WAL files aside and let the node recover without that stream. The stream will be reconstructed from peers:
Terminal window
```
# CAUTION: only do this if other replicas are healthy
mv /path/to/jetstream/streams/MY_STREAM/raft /tmp/MY_STREAM_raft_backup
```

Contact support for large WALs

If any WAL exceeds 50 GiB, contact Synadia support before attempting recovery. Large WAL recovery requires careful orchestration to avoid data loss, and the support team has tooling for safe WAL compaction.

Prevent future WAL growth

1
// Monitor WAL health programmatically
2
js, _ := nc.JetStream()
3
for _, name := range streamNames {
4
    info, err := js.StreamInfo(name)
5
    if err != nil {
6
        continue
7
    }
8
    for _, r := range info.Cluster.Replicas {
9
        if !r.Current && r.Lag > 10000 {
10
            log.Printf("WARNING: stream %s replica %s is stalled (lag: %d)",
11
                name, r.Name, r.Lag)
12
        }
13
    }
14
}

1
import nats
2

3
nc = await nats.connect()
4
js = nc.jetstream()
5

6
for name in stream_names:
7
    info = await js.stream_info(name)
8
    if info.cluster and info.cluster.replicas:
9
        for r in info.cluster.replicas:
10
            if not r.current and r.lag > 10000:
11
                print(f"WARNING: stream {name} replica {r.name} stalled (lag: {r.lag})")

Set up alerting on replica lag and offline peers to catch stalled followers before the WAL accumulates significantly.

Frequently asked questions

How large should a healthy WAL be?

A healthy WAL is typically a small fraction of the stream’s data size — usually under 1 GiB for most workloads. The exact size depends on write rate and compaction frequency. If the WAL is larger than the stream’s actual message storage, something is preventing compaction.

Can I increase max_bytes to accommodate WAL growth?

max_bytes on the stream config limits message storage, not WAL storage. The WAL is outside the stream’s configured limits. Increasing max_bytes won’t help with WAL growth. You need to address the root cause preventing WAL compaction.

Does WAL size affect publish latency?

Not directly during normal operation — the WAL is append-only, and writes are fast. But WAL growth is a symptom of compaction issues, which often correlate with stalled replicas or disk I/O problems. These underlying issues do affect publish latency through Raft commit delays.

Will the WAL shrink on its own once the stalled follower recovers?

Yes. Once all followers are caught up and acknowledge entries, the leader can truncate the WAL. Compaction will bring the WAL back to its normal size. If the stalled follower can’t recover (e.g., the node is permanently lost), removing it from the peer set allows truncation.

Does Insights alert before disk is full?

Yes. Synadia Insights monitors WAL size both as an absolute value and as a percentage of js_max_store. The check triggers well before disk exhaustion, giving you time to investigate and remediate. For critical WAL sizes (>50 GiB), the alert escalates to critical severity.

FEATURED

RESOURCES

Comparisons