Checks/JETSTREAM_016

NATS JetStream Storage vs Configured Limit Critical: What It Means and How to Fix It

Severity
Critical
Category
Saturation
Applies to
JetStream
Check ID
JETSTREAM_016
Detection threshold
JetStream storage usage critically exceeds configured max_store limit

JetStream storage on this server has reached critical levels relative to the configured max_store limit. At this threshold, Raft WAL writes are at imminent risk of failure. When WAL writes fail, streams lose quorum and become unavailable — this is not a gradual degradation but a hard stop for all JetStream operations on the affected server.

Why this matters

The max_store configuration sets a hard ceiling on how much disk space JetStream is allowed to use. Unlike reservation-based pressure (SERVER_017), which tracks how much storage has been promised to streams, this check tracks how much storage is actually consumed on disk. The distinction matters because actual usage includes operational overhead that reservations don’t account for: Raft WAL files, stream metadata, consumer state files, compaction temporary files, and snapshot data.

When actual disk usage hits the max_store limit, every write operation fails with ENOSPC or an equivalent storage-full error. The most immediate casualty is Raft consensus. Every message publish, consumer acknowledgment, and stream metadata update writes to a Raft WAL file. When the WAL write fails, the Raft group cannot commit the operation. The stream transitions to a state where it cannot accept new messages and cannot process acknowledgments. If the server hosts leaders for multiple streams, all of them fail simultaneously.

The cascade doesn’t stop at individual streams. When enough Raft groups on a server fail, the server’s JetStream subsystem marks itself as unhealthy. The meta-leader may attempt to move stream leaders away from the unhealthy server, but this process itself generates Raft operations that may also fail. In the worst case, the entire JetStream cluster becomes destabilized as leadership elections and rebalancing operations pile up.

This check (JETSTREAM_016) escalates from SERVER_019, which warns when storage is approaching the limit. If you’re seeing JETSTREAM_016, you’ve passed the warning stage. Immediate action is required to prevent or recover from stream outages.

Common causes

  • Streams without retention limits. Streams configured with max_msgs: -1, max_bytes: -1, and max_age: 0 grow without bound. A single unbounded stream under sustained publish load can consume all available storage. This is the most common cause by far.

  • Raft WAL accumulation. Raft WAL files grow between snapshots. Under high write throughput, WAL files can consume significant disk space. If snapshot creation is delayed — due to CPU pressure, slow disk I/O, or a bug — WAL files accumulate faster than they’re compacted.

  • Multiple R3/R5 streams on the same server. Each replica on a server consumes disk space independently. A server hosting replicas for many streams multiplies storage consumption. If stream placement isn’t balanced across the cluster, one server may bear a disproportionate storage load.

  • Message size growth without corresponding limit adjustment. Publishers start sending larger payloads (e.g., embedding images, adding metadata fields). A stream configured with max_msgs: 1000000 suddenly consumes 10x more bytes because average message size increased from 1KB to 10KB.

  • Compaction backlog. The filestore compacts old message blocks to reclaim space from deleted or expired messages. If compaction falls behind — due to I/O contention or CPU starvation — disk usage reflects pre-compaction sizes even though messages have been logically deleted.

How to diagnose

Confirm storage utilization

Terminal window
nats server report jetstream

Look for the storage columns. The affected server will show actual usage near or exceeding the configured limit:

1
╭─────────────────────────────────────────────────────────────────╮
2
│ JetStream Summary │
3
├─────────┬──────────┬───────────┬──────────┬────────────────────┤
4
│ Server │ Streams │ Store │ Max Store│ Store % │
5
├─────────┼──────────┼───────────┼──────────┼────────────────────┤
6
│ srv-1 │ 42 │ 98.7 GiB │ 100 GiB │ 98.7% ⚠ │
7
│ srv-2 │ 38 │ 62.1 GiB │ 100 GiB │ 62.1% │
8
│ srv-3 │ 40 │ 71.3 GiB │ 100 GiB │ 71.3% │
9
╰─────────┴──────────┴───────────┴──────────┴────────────────────╯

Identify the largest consumers of storage

Terminal window
# List streams on the affected server sorted by storage
nats stream list --server srv-1 --sort in-bytes

Check for streams without limits

Terminal window
nats stream list --json | jq '.[] | select(.config.max_bytes == -1 and .config.max_msgs == -1 and .config.max_age == 0) | .config.name'

Unbounded streams are the primary suspects for runaway storage growth.

Check Raft WAL sizes

Terminal window
# On the affected server, check WAL file sizes
du -sh /path/to/jetstream/$ACCOUNT/streams/*/raft/

WAL directories consuming more than a few hundred MB per stream may indicate compaction delays.

Programmatic detection

1
package main
2
3
import (
4
"encoding/json"
5
"fmt"
6
"log"
7
8
"github.com/nats-io/nats.go"
9
)
10
11
func main() {
12
nc, _ := nats.Connect(nats.DefaultURL)
13
14
resp, err := nc.Request("$SYS.REQ.SERVER.PING.JSZ", nil, 2*time.Second)
15
if err != nil {
16
log.Fatal(err)
17
}
18
19
var jsInfo struct {
20
Data struct {
21
Store uint64 `json:"storage"`
22
MaxStore int64 `json:"config>max_storage"`
23
} `json:"data"`
24
Server struct {
25
Name string `json:"name"`
26
} `json:"server"`
27
}
28
json.Unmarshal(resp.Data, &jsInfo)
29
30
usagePct := float64(jsInfo.Data.Store) / float64(jsInfo.Data.MaxStore) * 100
31
if usagePct > 95 {
32
fmt.Printf("CRITICAL: server %s at %.1f%% storage\n",
33
jsInfo.Server.Name, usagePct)
34
}
35
}
1
import asyncio
2
import json
3
import nats
4
5
async def check_storage_critical():
6
nc = await nats.connect()
7
8
resp = await nc.request("$SYS.REQ.SERVER.PING.JSZ", b"", timeout=2)
9
data = json.loads(resp.data)
10
11
store = data["data"]["storage"]
12
max_store = data["data"]["config"]["max_storage"]
13
14
if max_store > 0:
15
pct = (store / max_store) * 100
16
if pct > 95:
17
print(
18
f"CRITICAL: {data['server']['name']} at {pct:.1f}% "
19
f"({store / (1024**3):.1f} GiB / {max_store / (1024**3):.1f} GiB)"
20
)
21
22
await nc.close()
23
24
asyncio.run(check_storage_critical())

How to fix it

Immediate: free storage to prevent Raft failures

Purge low-priority streams. Identify streams that hold non-critical, replayable, or expired data and purge them:

Terminal window
# Purge all messages from a stream (keeps the stream config)
nats stream purge LOW_PRIORITY_STREAM -f
# Purge messages older than a threshold
nats stream purge STREAM_NAME --keep 1000

Delete abandoned or unused streams. Check for streams with no recent publishes or consumer activity:

Terminal window
# Find inactive streams
nats stream list --json | jq '.[] | select(.state.messages == 0 or .state.last_ts < "2024-01-01") | .config.name'
nats stream delete ABANDONED_STREAM -f

Force Raft snapshots to reclaim WAL space. Triggering a leader step-down on streams with large WAL directories forces snapshot creation and WAL compaction:

Terminal window
nats stream cluster step-down STREAM_NAME

Short-term: increase capacity or reduce load

Increase max_store. If the underlying disk has available capacity beyond the configured limit, increase the JetStream configuration:

nats-server.conf
1
jetstream {
2
max_mem: 4GB
3
max_store: 200GB # increased from 100GB
4
}

Reload the configuration:

Terminal window
nats server config reload <server-id>

Add retention limits to unbounded streams. Every stream should have at least one limit — max_age, max_bytes, or max_msgs:

Terminal window
nats stream edit STREAM_NAME --max-age 7d --max-bytes 10GB

Rebalance stream placement. If one server is overloaded while others have capacity, move stream replicas:

Terminal window
# Check storage distribution
nats server report jetstream
# Move a stream's leader to a server with more capacity
nats stream cluster step-down STREAM_NAME --preferred LESS_LOADED_SERVER

Long-term: prevent recurrence

Enforce stream limits as policy. Use account-level JetStream limits to cap per-account storage and require all streams to have retention limits:

1
accounts {
2
APP {
3
jetstream {
4
max_store: 50GB
5
max_streams: 20
6
}
7
}
8
}

Monitor actual usage, not just reservations. Alerting on reservation-based pressure (SERVER_017) is necessary but not sufficient. Also alert on actual disk usage approaching max_store.

Capacity plan for Raft overhead. Budget 15-20% overhead beyond stream data for Raft WAL files, consumer state, and compaction scratch space. A server with 100GB max_store should expect up to 85GB of usable stream data.

Implement automated purge policies. For streams that hold transient data (logs, events, metrics), configure max_age to automatically expire old messages rather than relying on manual purges.

Frequently asked questions

How is this different from SERVER_019?

SERVER_019 warns when storage is approaching the max_store limit. JETSTREAM_016 fires when storage has reached critical levels — typically above 95% — where Raft WAL write failures are imminent or already occurring. SERVER_019 is an early warning; JETSTREAM_016 is an active emergency.

Can I increase max_store without restarting the server?

Yes. Changing max_store in the configuration and sending a reload signal (nats server config reload <server-id> or kill -HUP <pid>) applies the new limit without a restart. No streams are disrupted.

Why does actual usage exceed what my streams report?

Stream storage (nats stream list --sort in-bytes) only shows message data. Actual JetStream disk usage also includes Raft WAL files, consumer state, stream metadata, snapshot files, and compaction temporary files. These operational overheads can add 10-20% to the total, more under high write throughput.

What happens to in-flight publishes when storage is full?

Publishes to streams whose Raft group is on the affected server will receive -ERR or a negative acknowledgment (NAK) indicating the publish failed. The publisher’s retry logic determines what happens next — if the publisher retries and the storage issue is resolved quickly, no messages are lost. If the publisher drops the message, it’s gone.

Should I add disk capacity or redistribute streams?

Both, depending on your situation. If one server is consistently over-utilized while others have headroom, redistribution is the faster fix. If all servers are approaching limits, you need more disk capacity or more aggressive retention policies across the cluster.

Proactive monitoring for NATS jetstream storage vs configured limit critical with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel