Checks/JETSTREAM_007

NATS JetStream Memory Utilization Critical: What It Means and How to Fix It

Severity
Critical
Category
Saturation
Applies to
JetStream
Check ID
JETSTREAM_007
Detection threshold
JetStream memory usage exceeds critical threshold (default: 95% of max_mem)

JetStream Memory Utilization Critical means a server’s JetStream memory usage has exceeded the critical threshold — memory-backed stream writes are failing or will fail imminently. This is the escalation beyond JetStream Memory Pressure (SERVER_005): the warning window has passed, and immediate action is required.

Why this matters

When JetStream memory utilization crosses the critical threshold, the server is at or past its max_mem reservation. Every publish to a memory-backed stream on this server receives a “no space” error. There is no graceful degradation — the transition is immediate and affects every memory-backed stream hosted on the server simultaneously.

The blast radius extends beyond the obvious. Publishers receiving errors may retry aggressively, increasing CPU and network load on the server without any possibility of success. Request-reply chains that depend on memory-backed streams for intermediate state break entirely. If the server hosts R3 stream replicas, the leader may still accept writes via other replicas — but if two of three replicas are memory-exhausted, the stream loses quorum for writes.

This condition does not self-resolve. Unlike file-backed storage pressure, where the OS page cache provides a buffer, memory-backed streams hold all data in RAM with no eviction mechanism. Until messages are explicitly removed — via consumer acknowledgment (workqueue retention), retention policy expiration (max_age), or manual purge — the memory remains consumed. Traffic continues arriving, errors continue accumulating, and the situation only worsens without intervention.

Common causes

  • Memory-backed streams without retention limits. Streams created with storage: memory and no max_bytes, max_age, or max_msgs limit grow without bound. This is the most common path to critical memory utilization — and the most preventable.

  • Ignored SERVER_005 warning. JetStream Memory Pressure (SERVER_005) fires at 90% utilization as a warning. If the warning is not addressed, memory continues climbing to the critical threshold. The gap between 90% and critical can close in minutes during traffic spikes.

  • Consumer outage on workqueue streams. Memory-backed workqueue streams retain messages until consumers acknowledge them. If the consuming application is down or stalled, messages accumulate in memory with no removal path. A 15-minute consumer outage at high publish rates can push memory from comfortable to critical.

  • Sudden traffic spike. A burst of publishes to memory-backed streams — batch jobs, backfills, incident-driven traffic — fills memory faster than any configured retention policy can clear. Retention policies operate on time (max_age) or count (max_msgs) boundaries that were tuned for steady-state, not peak.

  • Multiple memory-backed streams competing. Several memory-backed streams share the same server’s max_mem reservation. Each stream is individually within reasonable bounds, but their aggregate usage exceeds the reservation. No single stream is “the problem,” making it harder to identify the root cause.

How to diagnose

Confirm memory utilization levels

Terminal window
nats server report jetstream

The MEM and MEM MAX columns show current usage and the reservation. Calculate the percentage — anything above the critical threshold (typically 95%) triggers this check.

Identify the largest memory-backed streams

Terminal window
nats stream report

Focus on streams with storage: memory. Sort by bytes to find the largest consumers of JetStream memory.

Check for stalled consumers on workqueue streams

Terminal window
nats consumer report <stream_name>

For each memory-backed stream with workqueue retention, check UNPROCESSED and ACK PENDING. High values indicate consumers aren’t removing messages, preventing memory reclamation.

Verify retention limits are configured

Terminal window
nats stream info <stream_name>

Check Max Age, Max Msgs, and Max Bytes. If all are unlimited (-1), the stream has no automatic removal — messages only leave when explicitly purged or consumed (for workqueue streams).

Monitor the /jsz endpoint for real-time data

Terminal window
curl http://localhost:8222/jsz | jq '{memory: .memory, reserved_memory: .reserved_memory}'

How to fix it

Immediate: free memory now

This is a critical-severity check. Act first, optimize later.

Purge the largest non-essential memory-backed stream:

Terminal window
# Identify the largest memory-backed stream
nats stream report
# Purge it
nats stream purge <stream_name> --force

Add max_age to streams without retention limits:

Terminal window
# This immediately starts expiring old messages
nats stream edit <stream_name> --max-age 1h

Convert the largest memory-backed stream to file storage. This requires creating a new stream and migrating data, but it permanently removes the memory pressure for that workload:

Terminal window
# Create a file-backed replacement
nats stream add <stream_name>-file \
--subjects "<same_subjects>" \
--storage file \
--max-bytes 10GiB \
--max-age 24h \
--source <stream_name>
# Once mirrored, redirect publishers and consumers, then delete the old stream
nats stream rm <stream_name>
# Rename the new stream (requires NATS 2.10+)
# Or update application configs to use the new stream name

Short-term: prevent recurrence

Set max_bytes on every memory-backed stream. No memory-backed stream should be unbounded:

1
js, _ := nc.JetStream()
2
_, err := js.AddStream(&nats.StreamConfig{
3
Name: "REALTIME_PRICES",
4
Subjects: []string{"prices.>"},
5
Storage: nats.MemoryStorage,
6
MaxBytes: 256 * 1024 * 1024, // 256 MiB hard cap
7
MaxAge: 5 * time.Minute,
8
Discard: nats.DiscardOld,
9
})
1
from nats.js.api import StreamConfig, StorageType
2
3
await js.add_stream(StreamConfig(
4
name="REALTIME_PRICES",
5
subjects=["prices.>"],
6
storage=StorageType.MEMORY,
7
max_bytes=256 * 1024 * 1024, # 256 MiB hard cap
8
max_age=300, # 5 minutes
9
discard_policy="old",
10
))

Increase max_mem if the server has available RAM. Update the server configuration:

1
jetstream {
2
max_mem: 8GiB # increased from 4GiB
3
}

Reload without restart:

Terminal window
nats-server --signal reload

Verify the reservation took effect:

Terminal window
nats server report jetstream

Restart stalled consumers. If workqueue stream consumers are down, restart them immediately. Every minute they’re offline, more messages accumulate in memory with no removal path.

Long-term: establish memory governance

Default to file-backed storage. Make file storage the team default. Memory-backed streams should require explicit justification: a documented latency requirement that file-backed storage cannot meet. In practice, SSD-backed file streams provide single-digit millisecond latency, which satisfies the vast majority of workloads.

Use account-level JetStream limits. Cap the total memory each account can consume, preventing any single tenant or team from exhausting server memory:

1
accounts {
2
TEAM_A {
3
jetstream {
4
max_mem: 512MiB
5
max_disk: 50GiB
6
}
7
}
8
}

Synadia Insights evaluates JetStream memory utilization every epoch — SERVER_005 fires at 90% as a warning, and this check (JETSTREAM_007) fires at the critical threshold to ensure the escalation is visible.

Frequently asked questions

What’s the difference between SERVER_005 and JETSTREAM_007?

SERVER_005 (JetStream Memory Pressure) fires at 90% of max_mem as a warning — “you’re approaching the limit, investigate and take action.” JETSTREAM_007 (JetStream Memory Utilization Critical) fires at the critical threshold — “writes are failing or about to fail, act immediately.” They form a warning-to-critical escalation pair for the same underlying resource.

Can I increase max_mem without restarting the server?

Yes. Update the max_mem value in the server configuration file and send a reload signal:

Terminal window
nats-server --signal reload

The server increases the reservation without dropping connections or interrupting traffic. However, you cannot decrease max_mem below current usage — the server will reject the reload.

Why doesn’t the server evict old messages automatically?

It does — if retention limits are configured. Streams with max_age, max_msgs, or max_bytes automatically remove messages that exceed those limits. The problem occurs when streams have no limits (all set to -1/unlimited), which means the only message removal paths are manual purge or consumer acknowledgment (workqueue retention). Setting at least one retention limit on every stream prevents unbounded growth.

How much physical RAM should I reserve for NATS beyond max_mem?

Plan for at least 2x max_mem in total available RAM. The NATS server process uses additional memory for connection buffers, subscription routing tables, Raft state, internal data structures, and Go runtime overhead. A server with max_mem: 4GiB may use 6–8 GiB of total process memory under load. Monitor actual process RSS alongside JetStream memory to right-size your servers.

Can critical memory utilization cause data loss?

Not for data already written to the stream — persisted messages remain intact and readable even when the server is at memory capacity. The risk is for new publishes: with discard: old, the oldest messages are silently removed, which is data loss if consumers haven’t processed them. With discard: new, publishes are rejected, which is data loss at the publisher side if the publisher doesn’t handle the error and retry elsewhere. Neither scenario loses messages that are already acknowledged by consumers.

Proactive monitoring for NATS jetstream memory utilization critical with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel