Checks/SERVER_019

NATS JetStream Storage vs Configured Limit: What It Means and How to Fix It

Severity
Warning
Category
Saturation
Applies to
Server
Check ID
SERVER_019
Detection threshold
JetStream actual disk usage approaching configured max_store limit

Actual JetStream disk usage on this server is approaching the configured max_store limit — the hard filesystem ceiling for JetStream data. Unlike reservation-based pressure (SERVER_017), exceeding max_store causes Raft WAL write failures that make streams unavailable.

Why this matters

JetStream has two storage boundaries, and understanding the difference is critical. The reserved storage (js_reserved_storage) is the sum of all stream storage allocations — it tracks how much space has been promised to streams. The max_store limit is the hard cap on how much disk space JetStream is allowed to actually use. SERVER_017 warns when reservations are nearly full. This check (SERVER_019) warns when actual bytes on disk approach the physical limit.

The distinction matters because reservations and actual usage can diverge significantly. Streams may reserve 100GB but only use 30GB if they haven’t filled up yet. Conversely, operational overhead — Raft WAL files, consumer state, internal metadata — consumes disk space outside of stream reservations. The max_store limit accounts for all of it. When actual usage hits max_store, everything that writes to disk fails.

Raft consensus is the first casualty. Every JetStream operation that modifies state — publishing a message, acknowledging a delivery, updating a consumer cursor — writes to a Raft WAL file. When the WAL write fails due to ENOSPC (no space left on device), the Raft group cannot make progress. The affected stream or consumer becomes unavailable. If enough Raft groups on the server fail simultaneously, the server’s JetStream subsystem transitions to an unhealthy state, compounding the impact. This is not a graceful degradation — it’s a hard stop.

Common causes

  • max_store set too low for actual workload. The initial max_store configuration was based on estimates that underestimated actual data volumes. As streams grow and new streams are added, actual usage exceeds what was provisioned.

  • Raft WAL accumulation. Raft WAL files grow between compaction cycles. Under heavy write load, WAL files can accumulate faster than they are compacted, consuming disk space beyond what stream data alone would suggest. This overhead is not counted against stream reservations.

  • Stream data exceeding reservations. While JetStream enforces per-stream max_bytes limits, there are edge cases during rapid ingestion where actual on-disk usage temporarily exceeds the configured limit before retention enforcement catches up.

  • Consumer state files growing. Consumers with large pending acknowledgment sets maintain state on disk. Hundreds of consumers with large ack pending windows accumulate significant state data outside of stream data reservations.

  • Disk space consumed by non-JetStream data. If max_store was set to match the total disk size, other processes writing to the same filesystem (system logs, other applications, temporary files) reduce the space available to JetStream without changing the max_store configuration.

  • Snapshots and compaction lag. Raft snapshots replace WAL files, but the snapshot process requires temporarily having both the snapshot and the WAL on disk. If disk space is already tight, the snapshot itself may fail, preventing WAL cleanup and creating a vicious cycle.

How to diagnose

Check actual storage usage vs limit

Terminal window
nats server report jetstream

Compare the STORE (used) and STORE LIMIT columns for each server. When used is approaching the limit, this check fires. The percentage used is the key metric.

Check disk-level usage

Terminal window
# On the server host, check filesystem usage
df -h /path/to/jetstream/data
# Check the JetStream store directory specifically
du -sh /path/to/jetstream/data/*

If the filesystem is fuller than max_store suggests, other data on the same filesystem is consuming space.

Identify largest consumers of disk space

Terminal window
nats stream report

Sort by storage to find the largest streams. Then check for WAL and metadata overhead:

Terminal window
# On the server host, check Raft WAL sizes
find /path/to/jetstream/ -name "*.wal" -exec du -sh {} + | sort -rh | head -20
# Check consumer state file sizes
find /path/to/jetstream/ -path "*/obs/*" -exec du -sh {} + | sort -rh | head -20

Compare reserved vs actual usage

Terminal window
nats server report jetstream --json | jq '.[] | {
name: .name,
reserved_store: .reserved_storage,
used_store: .used_store,
max_store: .config.max_store
}'

If used_store is significantly higher than reserved_store, the difference is Raft overhead, consumer state, and metadata. This is the space that catches operators by surprise.

How to fix it

Immediate: prevent Raft failures

Free disk space now. The priority is to get below the danger threshold before Raft WAL writes start failing:

Terminal window
# Purge streams with expendable data
nats stream purge <stream_name>
# Delete unused streams
nats stream rm <unused_stream_name>
# For work-queue streams, ensure consumers are running and acknowledging
nats consumer info <stream_name> <consumer_name>

Check for non-JetStream disk consumers. If other processes are filling the disk, address those independently:

Terminal window
# On the server host
du -sh /var/log/*
du -sh /tmp/*

Clean up log files, temporary data, or other non-JetStream files consuming disk space on the JetStream partition.

Short-term: increase capacity or reduce usage

Increase max_store. If the disk has physical capacity, raise the limit. This requires a server restart:

nats-server.conf
1
jetstream: {
2
store_dir: "/data/jetstream"
3
max_store: 2TB
4
}
Terminal window
# Restart required for max_store changes
nats-server --signal stop
nats-server -c /path/to/nats-server.conf

Add retention limits to unbounded streams. Streams without max_bytes or max_age grow indefinitely. Adding limits ensures automatic cleanup:

Terminal window
nats stream edit <stream_name> --max-bytes 100GB --max-age 14d

Enable compression. S2 compression reduces disk usage significantly:

Terminal window
nats stream edit <stream_name> --compression s2
1
// Go - monitor JetStream storage usage relative to limits
2
nc, _ := nats.Connect(url)
3
js, _ := nc.JetStream()
4
5
info, _ := js.AccountInfo()
6
usedPct := float64(info.Store) / float64(info.Limits.MaxStore) * 100
7
log.Printf("JetStream storage: %.1f%% used (%d / %d bytes)",
8
usedPct, info.Store, info.Limits.MaxStore)
9
10
if usedPct > 85 {
11
log.Println("WARNING: approaching max_store limit")
12
}
1
# Python - check storage usage against limit
2
import nats
3
4
async def check_storage_limit():
5
nc = await nats.connect(server_url)
6
js = nc.jetstream()
7
8
info = await js.account_info()
9
if info.limits.max_storage > 0:
10
used_pct = (info.storage / info.limits.max_storage) * 100
11
print(f"Storage: {used_pct:.1f}% of limit")
12
if used_pct > 85:
13
print("WARNING: approaching max_store limit")
14
15
await nc.close()

Long-term: prevent recurrence

Separate JetStream data onto a dedicated filesystem. If JetStream shares a filesystem with other applications, a dedicated mount prevents external disk consumers from affecting JetStream:

1
# nats-server.conf - dedicated store directory on its own mount
2
jetstream: {
3
store_dir: "/mnt/jetstream-data"
4
max_store: 1800GB # Leave 10% headroom on a 2TB drive
5
}

Set max_store below actual disk capacity. Always leave headroom — at least 10% of total disk size — for Raft overhead, compaction, and unexpected growth:

1
Disk size: 2TB
2
max_store: 1.8TB (90% of disk)
3
Alert threshold: 90% of max_store (1.62TB actual usage)

Implement proactive capacity alerting. Alert well before reaching the limit.

Balance storage across cluster members. If some servers are near their limit while others have headroom, rebalance stream placement. The JetStream Storage Skew check (OPT_BALANCE_003) identifies this imbalance. Use stream placement tags or move replicas to less-loaded servers.

Establish disk capacity reviews. Quarterly review JetStream disk usage trends, growth rates, and upcoming workload changes. Proactively expand storage or add servers before usage reaches warning levels.

Frequently asked questions

How is this different from SERVER_017 JetStream Storage Pressure?

SERVER_017 fires when reserved storage (the sum of all stream allocations) reaches 90%. It tracks how much space has been promised. SERVER_019 fires when actual disk usage approaches max_store. It tracks how much space is really being consumed. You can have SERVER_017 clear (reservations are fine) while SERVER_019 fires (actual disk usage is high due to Raft overhead or consumer state). SERVER_019 is the more dangerous condition because exceeding max_store causes immediate Raft failures.

What happens when actual usage exceeds max_store?

Raft WAL writes fail with disk-full errors. Any stream or consumer whose Raft group needs to write to the WAL becomes unavailable. Publishes fail, consumer deliveries stop, and leader elections cannot complete. The server may report its JetStream subsystem as unhealthy (triggering SERVER_014). Recovery requires freeing disk space or increasing max_store and restarting.

Can I change max_store without a restart?

No. The max_store configuration is read at server startup and cannot be changed via a configuration reload. You must stop and restart the server. This is why it’s important to set max_store with sufficient headroom initially — changing it is not a zero-downtime operation (though in a cluster, you can rolling-restart one server at a time).

How much overhead should I expect beyond stream data?

Raft WAL files, consumer state, and internal metadata typically add 5-15% overhead beyond raw stream data. Under heavy write load or with many consumers, overhead can reach 20%. When sizing max_store, use: max_store = (expected stream data × 1.2) + 10% headroom. For a server expecting 1TB of stream data: max_store = 1.2TB + 120GB = ~1.32TB, and a 1.5TB disk provides comfortable margin.

Why did actual usage exceed reserved storage?

Reserved storage tracks stream-level allocations. Raft WAL files (which exist per stream and per consumer Raft group), consumer state files, metadata, and compaction temporary files all consume disk space outside of stream reservations. A server with 200 streams and 500 consumers has 700 Raft groups, each maintaining its own WAL. This overhead adds up, especially under heavy write load when WALs grow between compaction cycles.

Proactive monitoring for NATS jetstream storage vs configured limit with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel