Checks/JETSTREAM_012

NATS JetStream Storage Utilization Critical: Preventing Storage Exhaustion

Severity
Critical
Category
Saturation
Applies to
JetStream
Check ID
JETSTREAM_012
Detection threshold
server-level JetStream storage usage exceeds the critical threshold

JetStream stores stream and consumer data on disk (or in memory). Each NATS server has a configured maximum storage capacity for JetStream (max_store). When actual storage usage on a server exceeds the critical threshold, the server is at imminent risk of storage exhaustion. Stream writes will fail with I/O errors, and the server may become unable to participate in Raft consensus for replicated streams.

Why this matters

Storage exhaustion is one of the most severe failure modes in a NATS deployment. Unlike memory pressure, which the OS can partially manage through swapping, disk exhaustion is a hard wall. When the filesystem fills up, every write operation fails — not just JetStream stream appends, but also Raft log entries, consumer acknowledgment state, and meta-group operations.

The failure cascade is rapid. First, stream writes fail. Producers receive errors or, in fire-and-forget mode, messages are silently lost. Then consumer state updates fail — acknowledgments can’t be persisted, causing redelivery loops when the server restarts. Raft log entries can’t be written, so the server falls behind on consensus and may lose leadership of streams it was leading. In the worst case, the server’s JetStream subsystem marks itself unhealthy and stops processing all JetStream operations.

Replicated streams amplify the problem. If a server holds replicas of 50 streams and runs out of storage, all 50 streams lose a replica simultaneously. If the cluster has R3 streams and two servers exhaust storage at the same time, those streams lose quorum and become unavailable for writes — even though the data may be safe on the remaining server.

The critical threshold exists to provide an action window. At this point, the server is not yet out of space, but the trajectory is clear. Without intervention — freeing storage, adding capacity, or reducing write rates — exhaustion is imminent. Every hour of inaction narrows the margin until it’s gone.

Recovery from actual storage exhaustion is painful. The server may need manual intervention to start — clearing Raft state, rebuilding stream indices, or even re-syncing replicas from healthy peers. Prevention is dramatically cheaper than recovery.

Common causes

  • Streams without max_bytes limits. Streams with no byte limit grow until the server’s storage is full. A single unlimited stream receiving steady traffic can consume all available capacity.

  • Retention periods longer than storage can sustain. A stream with max_age: 30d on a subject producing 1 GB/day needs 30 GB. If the server’s max_store doesn’t account for all such streams, the aggregate exceeds capacity.

  • Uncompressed large streams. Streams storing text-based messages (JSON, XML, logs) without S2 compression use 2–5x more storage than necessary.

  • Storage skew across cluster members. If stream placement or leader distribution is uneven, some servers hold disproportionately more data than others. One server exhausts storage while others have capacity to spare.

  • Raft snapshot and WAL accumulation. Raft write-ahead logs and snapshots consume storage alongside stream data. Under high write rates or when snapshot compaction falls behind, Raft overhead can be significant.

  • max_store set too low. The server’s JetStream storage limit doesn’t reflect the actual disk capacity or the expected workload. This is common when servers are provisioned with default or copied configurations.

  • Burst traffic without capacity planning. A data migration, backfill, or traffic spike writes more data than the system was provisioned for in a short time window.

How to diagnose

Check server-level storage utilization

Terminal window
nats server report jetstream

This shows each server’s JetStream storage usage, configured limit, and percentage used. Servers at or above the critical threshold are immediately visible.

For a specific server:

Terminal window
nats server report jetstream --host <server_name>

Identify the largest streams on the affected server

Terminal window
nats stream report

This lists all streams with their current byte usage, making it easy to identify which streams consume the most storage. Sort by bytes to find the top consumers.

For server-specific stream placement:

Terminal window
nats server report jetstream --json | jq '.[] | select(.name == "<server_name>") | .streams'

Check disk-level storage

The JetStream storage directory is typically under the server’s store_dir. Check actual filesystem usage:

Terminal window
# Check the JetStream data directory
du -sh /path/to/nats/jetstream/
# Check filesystem capacity
df -h /path/to/nats/jetstream/

If the filesystem is fuller than the JetStream utilization percentage suggests, non-JetStream data (logs, other applications) may be competing for the same disk.

Check for streams without limits

Terminal window
nats stream list --json | jq '[.[] | select(.config.max_bytes == -1)] | .[] | .config.name'

Unlimited streams are the primary risk factor for storage exhaustion because their growth is constrained only by the server’s total capacity.

Terminal window
# Check storage over the monitoring endpoint
curl -s http://localhost:8222/jsz | jq '{
store: .memory,
store_used: .storage,
store_reserved: .reserved_storage,
store_limit: .config.max_storage
}'

Track store_used over time to determine the growth rate and estimate time-to-exhaustion.

Audit programmatically

1
package main
2
3
import (
4
"context"
5
"fmt"
6
"log"
7
8
"github.com/nats-io/nats.go"
9
"github.com/nats-io/nats.go/jetstream"
10
)
11
12
func main() {
13
nc, err := nats.Connect("nats://localhost:4222")
14
if err != nil {
15
log.Fatal(err)
16
}
17
defer nc.Close()
18
19
js, err := jetstream.New(nc)
20
if err != nil {
21
log.Fatal(err)
22
}
23
24
ctx := context.Background()
25
info, err := js.AccountInfo(ctx)
26
if err != nil {
27
log.Fatal(err)
28
}
29
30
storePct := float64(info.Store) / float64(info.Limits.MaxStore) * 100
31
fmt.Printf("JetStream storage: %d / %d bytes (%.1f%%)\n",
32
info.Store, info.Limits.MaxStore, storePct)
33
34
// Find top streams by size
35
lister := js.ListStreams(ctx)
36
type streamSize struct {
37
name string
38
bytes uint64
39
}
40
var streams []streamSize
41
for si := range lister.Info() {
42
streams = append(streams, streamSize{si.Config.Name, si.State.Bytes})
43
}
44
45
fmt.Println("\nTop streams by storage:")
46
for _, s := range streams {
47
fmt.Printf(" %-30s %d bytes\n", s.name, s.bytes)
48
}
49
}

How to fix it

Immediate: free storage now

Purge or delete low-priority streams. Identify streams that hold non-critical data and purge them:

Terminal window
# Purge all messages from a stream (keeps the stream config)
nats stream purge <stream_name>
# Delete the stream entirely (frees all storage and reservations)
nats stream delete <stream_name>

Prioritize streams that are large, unused, or hold easily reproducible data (logs, metrics, test data).

Purge messages beyond retention needs. For streams that must continue operating but hold more history than necessary:

Terminal window
# Keep only the latest 10,000 messages
nats stream purge <stream_name> --keep 10000
# Purge messages older than a specific sequence
nats stream purge <stream_name> --seq <sequence_number>

Truncate Raft WAL if oversized. In extreme cases, Raft write-ahead logs can consume significant space. The server compacts these automatically, but if compaction is falling behind, a server restart triggers immediate compaction.

Short-term: reduce ongoing storage consumption

Set max_bytes on all unlimited streams. This is the most critical configuration change to prevent recurrence:

Terminal window
nats stream edit <stream_name> --max-bytes 10GB

Set limits based on actual usage plus a reasonable growth buffer. Every stream should have a max_bytes value.

Enable S2 compression. For streams with compressible data (JSON, text, logs):

Terminal window
nats stream edit <stream_name> --compression s2

Compression typically reduces storage by 40–70%, providing immediate relief.

Add or shorten max_age retention. Reduce how long messages are retained:

Terminal window
nats stream edit <stream_name> --max-age 3d

Shorter retention windows reduce steady-state storage requirements proportionally.

Reduce replica counts where appropriate. Streams that don’t need R3 durability can be reduced to R1:

Terminal window
nats stream edit <stream_name> --replicas 1

This frees storage on two servers per stream. Only do this for streams where the durability trade-off is acceptable.

Long-term: prevent storage exhaustion

Increase max_store or add disk capacity. If the workload legitimately needs more storage:

nats-server.conf
1
jetstream {
2
store_dir: /data/nats
3
max_store: 500GB
4
}

Reload the server configuration:

Terminal window
nats server config reload <server-id>

Or provision larger disks and migrate the store directory.

Implement capacity planning. Calculate required storage: sum(max_bytes × replicas) across all streams, plus 20% overhead for Raft state and operational margin. Ensure max_store and physical disk exceed this sum.

Monitor and alert aggressively. Storage exhaustion is a preventable failure. Alert at 70% (warning) and 85% (critical).

Synadia Insights evaluates server-level JetStream storage utilization at every collection interval and alerts at the critical threshold, giving operators time to respond before writes fail.

Balance stream placement across servers. Use placement tags or preferred server placement to distribute streams evenly across the cluster, preventing individual servers from bearing disproportionate storage load.

Frequently asked questions

What’s the difference between storage pressure and storage utilization critical?

Storage pressure (JETSTREAM_008) is an early warning — storage is elevated but not yet dangerous. Storage utilization critical (this check) indicates the server is near the point of exhaustion where writes will fail. Think of pressure as the yellow light and critical as the red light.

Do purged messages free storage immediately?

Yes. Purging removes messages from disk immediately. The storage is reclaimed by the filesystem and the server’s JetStream storage usage counter decreases accordingly. There is no delayed garbage collection — the space is available for new writes immediately.

Can I move streams to a different server to free space?

Not directly. NATS doesn’t support live stream migration between servers. However, you can reduce a stream’s replica count (freeing space on removed replicas) and then increase it again with placement hints to add replicas on servers with more capacity. For R1 streams, you would need to recreate the stream on a different server.

What happens to in-flight publishes when storage is exhausted?

Publishes to streams fail with a JetStream API error indicating insufficient storage. For publishers using PublishAsync, the publish acknowledgment returns an error. For synchronous publishes, the error is returned directly. Core NATS publishes to subjects not consumed by JetStream are unaffected — storage exhaustion only impacts JetStream operations.

How much overhead should I reserve beyond stream data?

Reserve at least 15–20% of max_store beyond the sum of all stream max_bytes reservations. This accounts for Raft write-ahead logs, consumer state, meta-group operations, and temporary storage during compaction and snapshots. In write-heavy deployments, 25% overhead is safer.

Proactive monitoring for NATS jetstream storage utilization critical with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel