JetStream storage on this server has reached critical levels relative to the configured max_store limit. At this threshold, Raft WAL writes are at imminent risk of failure. When WAL writes fail, streams lose quorum and become unavailable — this is not a gradual degradation but a hard stop for all JetStream operations on the affected server.
The max_store configuration sets a hard ceiling on how much disk space JetStream is allowed to use. Unlike reservation-based pressure (SERVER_017), which tracks how much storage has been promised to streams, this check tracks how much storage is actually consumed on disk. The distinction matters because actual usage includes operational overhead that reservations don’t account for: Raft WAL files, stream metadata, consumer state files, compaction temporary files, and snapshot data.
When actual disk usage hits the max_store limit, every write operation fails with ENOSPC or an equivalent storage-full error. The most immediate casualty is Raft consensus. Every message publish, consumer acknowledgment, and stream metadata update writes to a Raft WAL file. When the WAL write fails, the Raft group cannot commit the operation. The stream transitions to a state where it cannot accept new messages and cannot process acknowledgments. If the server hosts leaders for multiple streams, all of them fail simultaneously.
The cascade doesn’t stop at individual streams. When enough Raft groups on a server fail, the server’s JetStream subsystem marks itself as unhealthy. The meta-leader may attempt to move stream leaders away from the unhealthy server, but this process itself generates Raft operations that may also fail. In the worst case, the entire JetStream cluster becomes destabilized as leadership elections and rebalancing operations pile up.
This check (JETSTREAM_016) escalates from SERVER_019, which warns when storage is approaching the limit. If you’re seeing JETSTREAM_016, you’ve passed the warning stage. Immediate action is required to prevent or recover from stream outages.
Streams without retention limits. Streams configured with max_msgs: -1, max_bytes: -1, and max_age: 0 grow without bound. A single unbounded stream under sustained publish load can consume all available storage. This is the most common cause by far.
Raft WAL accumulation. Raft WAL files grow between snapshots. Under high write throughput, WAL files can consume significant disk space. If snapshot creation is delayed — due to CPU pressure, slow disk I/O, or a bug — WAL files accumulate faster than they’re compacted.
Multiple R3/R5 streams on the same server. Each replica on a server consumes disk space independently. A server hosting replicas for many streams multiplies storage consumption. If stream placement isn’t balanced across the cluster, one server may bear a disproportionate storage load.
Message size growth without corresponding limit adjustment. Publishers start sending larger payloads (e.g., embedding images, adding metadata fields). A stream configured with max_msgs: 1000000 suddenly consumes 10x more bytes because average message size increased from 1KB to 10KB.
Compaction backlog. The filestore compacts old message blocks to reclaim space from deleted or expired messages. If compaction falls behind — due to I/O contention or CPU starvation — disk usage reflects pre-compaction sizes even though messages have been logically deleted.
nats server report jetstreamLook for the storage columns. The affected server will show actual usage near or exceeding the configured limit:
1╭─────────────────────────────────────────────────────────────────╮2│ JetStream Summary │3├─────────┬──────────┬───────────┬──────────┬────────────────────┤4│ Server │ Streams │ Store │ Max Store│ Store % │5├─────────┼──────────┼───────────┼──────────┼────────────────────┤6│ srv-1 │ 42 │ 98.7 GiB │ 100 GiB │ 98.7% ⚠ │7│ srv-2 │ 38 │ 62.1 GiB │ 100 GiB │ 62.1% │8│ srv-3 │ 40 │ 71.3 GiB │ 100 GiB │ 71.3% │9╰─────────┴──────────┴───────────┴──────────┴────────────────────╯# List streams on the affected server sorted by storagenats stream list --server srv-1 --sort in-bytesnats stream list --json | jq '.[] | select(.config.max_bytes == -1 and .config.max_msgs == -1 and .config.max_age == 0) | .config.name'Unbounded streams are the primary suspects for runaway storage growth.
# On the affected server, check WAL file sizesdu -sh /path/to/jetstream/$ACCOUNT/streams/*/raft/WAL directories consuming more than a few hundred MB per stream may indicate compaction delays.
1package main2
3import (4 "encoding/json"5 "fmt"6 "log"7
8 "github.com/nats-io/nats.go"9)10
11func main() {12 nc, _ := nats.Connect(nats.DefaultURL)13
14 resp, err := nc.Request("$SYS.REQ.SERVER.PING.JSZ", nil, 2*time.Second)15 if err != nil {16 log.Fatal(err)17 }18
19 var jsInfo struct {20 Data struct {21 Store uint64 `json:"storage"`22 MaxStore int64 `json:"config>max_storage"`23 } `json:"data"`24 Server struct {25 Name string `json:"name"`26 } `json:"server"`27 }28 json.Unmarshal(resp.Data, &jsInfo)29
30 usagePct := float64(jsInfo.Data.Store) / float64(jsInfo.Data.MaxStore) * 10031 if usagePct > 95 {32 fmt.Printf("CRITICAL: server %s at %.1f%% storage\n",33 jsInfo.Server.Name, usagePct)34 }35}1import asyncio2import json3import nats4
5async def check_storage_critical():6 nc = await nats.connect()7
8 resp = await nc.request("$SYS.REQ.SERVER.PING.JSZ", b"", timeout=2)9 data = json.loads(resp.data)10
11 store = data["data"]["storage"]12 max_store = data["data"]["config"]["max_storage"]13
14 if max_store > 0:15 pct = (store / max_store) * 10016 if pct > 95:17 print(18 f"CRITICAL: {data['server']['name']} at {pct:.1f}% "19 f"({store / (1024**3):.1f} GiB / {max_store / (1024**3):.1f} GiB)"20 )21
22 await nc.close()23
24asyncio.run(check_storage_critical())Purge low-priority streams. Identify streams that hold non-critical, replayable, or expired data and purge them:
# Purge all messages from a stream (keeps the stream config)nats stream purge LOW_PRIORITY_STREAM -f
# Purge messages older than a thresholdnats stream purge STREAM_NAME --keep 1000Delete abandoned or unused streams. Check for streams with no recent publishes or consumer activity:
# Find inactive streamsnats stream list --json | jq '.[] | select(.state.messages == 0 or .state.last_ts < "2024-01-01") | .config.name'
nats stream delete ABANDONED_STREAM -fForce Raft snapshots to reclaim WAL space. Triggering a leader step-down on streams with large WAL directories forces snapshot creation and WAL compaction:
nats stream cluster step-down STREAM_NAMEIncrease max_store. If the underlying disk has available capacity beyond the configured limit, increase the JetStream configuration:
1jetstream {2 max_mem: 4GB3 max_store: 200GB # increased from 100GB4}Reload the configuration:
nats server config reload <server-id>Add retention limits to unbounded streams. Every stream should have at least one limit — max_age, max_bytes, or max_msgs:
nats stream edit STREAM_NAME --max-age 7d --max-bytes 10GBRebalance stream placement. If one server is overloaded while others have capacity, move stream replicas:
# Check storage distributionnats server report jetstream
# Move a stream's leader to a server with more capacitynats stream cluster step-down STREAM_NAME --preferred LESS_LOADED_SERVEREnforce stream limits as policy. Use account-level JetStream limits to cap per-account storage and require all streams to have retention limits:
1accounts {2 APP {3 jetstream {4 max_store: 50GB5 max_streams: 206 }7 }8}Monitor actual usage, not just reservations. Alerting on reservation-based pressure (SERVER_017) is necessary but not sufficient. Also alert on actual disk usage approaching max_store.
Capacity plan for Raft overhead. Budget 15-20% overhead beyond stream data for Raft WAL files, consumer state, and compaction scratch space. A server with 100GB max_store should expect up to 85GB of usable stream data.
Implement automated purge policies. For streams that hold transient data (logs, events, metrics), configure max_age to automatically expire old messages rather than relying on manual purges.
SERVER_019 warns when storage is approaching the max_store limit. JETSTREAM_016 fires when storage has reached critical levels — typically above 95% — where Raft WAL write failures are imminent or already occurring. SERVER_019 is an early warning; JETSTREAM_016 is an active emergency.
Yes. Changing max_store in the configuration and sending a reload signal (nats server config reload <server-id> or kill -HUP <pid>) applies the new limit without a restart. No streams are disrupted.
Stream storage (nats stream list --sort in-bytes) only shows message data. Actual JetStream disk usage also includes Raft WAL files, consumer state, stream metadata, snapshot files, and compaction temporary files. These operational overheads can add 10-20% to the total, more under high write throughput.
Publishes to streams whose Raft group is on the affected server will receive -ERR or a negative acknowledgment (NAK) indicating the publish failed. The publisher’s retry logic determines what happens next — if the publisher retries and the storage issue is resolved quickly, no messages are lost. If the publisher drops the message, it’s gone.
Both, depending on your situation. If one server is consistently over-utilized while others have headroom, redistribution is the faster fix. If all servers are approaching limits, you need more disk capacity or more aggressive retention policies across the cluster.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community