Checks/JETSTREAM_010

NATS Stream Byte Limit: Diagnosing and Resolving Storage Saturation

Severity
Warning
Category
Saturation
Applies to
JetStream
Check ID
JETSTREAM_010
Detection threshold
stream byte usage ≥ 90% of the configured max_bytes limit

A JetStream stream’s max_bytes setting caps how much storage the stream can use. When a stream reaches 90% or more of its byte limit, it is close to triggering its retention policy — either discarding old messages (LimitsPolicy) or rejecting new publishes (WorkQueuePolicy with discard new). This check flags streams approaching that threshold.

Why this matters

When a stream hits its byte limit, the behavior depends on the stream’s discard policy. With DiscardOld (the default), the server silently removes the oldest messages to make room for new ones. This works well for time-series data or logs where old data is expendable, but it’s dangerous for work queues or event sourcing where every message matters. Old messages vanish without any client notification — consumers that haven’t processed them yet lose data permanently.

With DiscardNew, the server rejects new publishes when the stream is full. Producers receive a maximum bytes exceeded error. This protects existing data but creates backpressure that propagates upstream — publishers retry, queues back up, and the system’s throughput drops to whatever rate consumers can drain the stream.

Neither outcome is desirable in production. Silent data loss causes correctness issues that surface hours or days later when someone notices missing events. Publish rejections cause availability issues that surface immediately as errors and timeouts.

The 90% threshold is the critical window. The stream isn’t broken yet, but at current write rates it will be soon. This is the time to act — right-size the limit, clean up stale data, or add retention policies — before the discard policy kicks in and the consequences become real.

In replicated streams, the byte limit applies to each replica independently, but the stream’s reported size is the leader’s perspective. A stream at 90% on the leader is at roughly 90% on every follower, meaning the problem affects the entire replica set simultaneously.

Common causes

  • No max_age retention. The stream retains messages indefinitely, growing until max_bytes is reached. This is the most common cause — the stream has a byte limit but no time-based expiry to naturally cycle out old data.

  • Publish rate exceeds consumption rate. Producers write faster than consumers acknowledge and the stream processes messages. The unconsumed message backlog grows steadily toward the limit.

  • Large message payloads. A stream receiving 1 KB messages consumes bytes 100x slower than one receiving 100 KB messages at the same message rate. Payload size often increases without the byte limit being adjusted to match.

  • Burst traffic without headroom. The stream was sized for steady-state throughput but a batch import, backfill, or incident-driven surge pushes storage usage past the margin.

  • Consumer lag allowing message accumulation. Consumers are running but behind — processing messages at a rate slower than the publish rate. The stream accumulates a growing backlog that approaches the byte limit.

  • Byte limit set too low. The original max_bytes was a guess or a copy from another stream and doesn’t reflect the actual data volume for this workload.

How to diagnose

Check stream storage usage

Terminal window
nats stream info <stream_name>

Look at the Bytes field under State and compare it to Max Bytes in the configuration. Calculate the utilization percentage.

For all streams at once:

Terminal window
nats stream report

This shows each stream’s storage usage, message count, and configured limits. Streams near their byte limits are immediately visible.

Check the stream’s discard and retention policy

Terminal window
nats stream info <stream_name> --json | jq '{
max_bytes: .config.max_bytes,
current_bytes: .state.bytes,
retention: .config.retention,
discard: .config.discard,
max_age: .config.max_age,
compression: .config.compression
}'

The combination of retention, discard, and max_age determines what happens when the stream fills up. Understanding the current policy is essential before making changes.

Check publish vs. consume rate

Terminal window
# Watch the stream's message rate in real time
nats stream info <stream_name> --json | jq '.state.num_pending'

Run this repeatedly to see whether the stream is growing, stable, or draining. Alternatively, use the consumer report:

Terminal window
nats consumer report <stream_name>

A consumer with growing num_pending messages indicates the consumption rate is lower than the publish rate.

Inspect message sizes

Terminal window
nats stream view <stream_name> 10 | head -50

Review recent messages to understand payload sizes. If messages are larger than expected, payload compression or schema changes may be warranted.

Audit programmatically

1
package main
2
3
import (
4
"context"
5
"fmt"
6
"log"
7
8
"github.com/nats-io/nats.go"
9
"github.com/nats-io/nats.go/jetstream"
10
)
11
12
func main() {
13
nc, err := nats.Connect("nats://localhost:4222")
14
if err != nil {
15
log.Fatal(err)
16
}
17
defer nc.Close()
18
19
js, err := jetstream.New(nc)
20
if err != nil {
21
log.Fatal(err)
22
}
23
24
ctx := context.Background()
25
lister := js.ListStreams(ctx)
26
for info := range lister.Info() {
27
if info.Config.MaxBytes > 0 {
28
pct := float64(info.State.Bytes) / float64(info.Config.MaxBytes) * 100
29
if pct >= 80 {
30
fmt.Printf("⚠️ %-30s %.1f%% full (%d / %d bytes)\n",
31
info.Config.Name, pct, info.State.Bytes, info.Config.MaxBytes)
32
}
33
}
34
}
35
}

How to fix it

Immediate: free space now

Purge stale data. If the stream contains historical data that is no longer needed:

Terminal window
# Purge all messages
nats stream purge <stream_name>
# Purge messages older than a specific sequence
nats stream purge <stream_name> --seq 1000000
# Keep only the latest N messages
nats stream purge <stream_name> --keep 10000

Increase max_bytes. If the stream legitimately needs more space:

Terminal window
nats stream edit <stream_name> --max-bytes 50GB

Ensure the account has sufficient storage quota and the server has enough disk capacity before increasing.

Short-term: optimize storage efficiency

Enable S2 compression. NATS supports S2 (Snappy) compression for stream data, which typically reduces storage by 40–70% for text-based messages:

Terminal window
nats stream edit <stream_name> --compression s2

This reduces actual bytes stored without changing the logical message content. Compression is transparent to consumers.

Add max_age retention. Time-based expiry prevents unbounded growth by automatically removing messages older than the retention window:

Terminal window
nats stream edit <stream_name> --max-age 7d

Choose a retention period that balances data availability with storage constraints. For most operational streams, 3–14 days provides sufficient replay capability.

Reduce message payload sizes. If messages contain verbose formats (JSON with long field names, base64-encoded binary), consider switching to more compact encodings (Protocol Buffers, MessagePack, CBOR) at the application layer.

Long-term: design for sustainable growth

Set max_age on every stream. Even streams with generous byte limits should have a time-based retention policy. This ensures natural data cycling and prevents the “everything is retained forever” default from causing saturation.

Monitor byte utilization proactively. Alert at 80% to give the team time to respond before 90%.

Synadia Insights evaluates stream byte utilization automatically at every collection interval, alerting before the discard policy activates.

Right-size byte limits during stream creation. Estimate storage needs based on message size, publish rate, and retention period: max_bytes = avg_msg_size × msgs_per_second × retention_seconds × safety_factor. A 2x safety factor accommodates traffic bursts without triggering the limit.

Implement consumer health monitoring. Consumer lag is an early warning for stream saturation. If consumers fall behind and messages accumulate, the stream fills faster. Monitor num_pending per consumer and alert on sustained growth.

Frequently asked questions

What happens when a stream reaches max_bytes with DiscardOld?

The server automatically deletes the oldest messages to make room for new publishes. No error is returned to the publisher. Consumers that haven’t processed the deleted messages lose access to them permanently. There is no notification — the messages simply disappear from the stream.

What happens with DiscardNew instead?

New publishes are rejected with a maximum bytes exceeded error. Existing messages remain intact. This is safer for work queues where every message must be processed, but it means producers must handle publish failures and implement retry or buffering logic.

Does compression change the byte limit behavior?

Yes. With S2 compression enabled, the max_bytes limit applies to the compressed on-disk size. This means you can store more logical data within the same byte limit. A stream at 90% utilization without compression might drop to 50% after enabling S2, depending on message compressibility.

How does max_bytes interact with max_age?

Both limits are enforced simultaneously. A message is removed when it exceeds max_age OR when the stream exceeds max_bytes (whichever comes first). Setting both provides a double layer of protection: time-based expiry for normal operation, and byte-based limits as a safety cap for traffic spikes.

Should I use max_bytes or max_msgs or both?

Use max_bytes as the primary storage constraint — it directly maps to disk capacity. max_msgs is useful when the number of messages matters more than their total size (e.g., a bounded event log). Using both provides defense in depth: the stream is constrained by whichever limit is reached first.

Proactive monitoring for NATS stream byte limit with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel