Checks/CLUSTER_001

NATS Memory Usage Outlier: What It Means and How to Fix It

Severity
Warning
Category
Saturation
Applies to
Cluster
Check ID
CLUSTER_001
Detection threshold
server memory exceeds 1.5× the cluster average

A memory usage outlier is a NATS server whose memory usage exceeds 1.5 times the cluster average. This imbalance signals uneven workload distribution, resource leaks, or configuration drift that, left unchecked, puts the outlier server at risk of OOM termination and degrades cluster resilience.

Why this matters

NATS clusters are designed to distribute load across all members. When one server consumes substantially more memory than its peers, the cluster is no longer balanced. That server becomes the weak link — the first to hit OS memory limits, the first to trigger OOM kills, the first to destabilize under a traffic spike. A server that’s already running at 1.5× the cluster average has far less headroom than its peers to absorb sudden load increases.

The failure mode is abrupt. When the OS kills a NATS server process for exceeding memory limits, every client connected to that server disconnects simultaneously. JetStream Raft groups lose a member, potentially losing quorum if another replica was already degraded. Route connections from the dead server drop, fragmenting the cluster until the process restarts. Because the outlier was handling a disproportionate share of work (that’s why it was using more memory), the impact of its loss is also disproportionate.

Memory outliers are also a leading indicator of deeper problems. A server that’s slowly diverging from its peers often has an underlying issue — a stream with runaway subject cardinality, a connection hotspot funneling too many clients to one node, or a JetStream cache that’s grown because of skewed access patterns. Catching the outlier early, before it crashes, gives you time to investigate and rebalance without impacting production traffic.

Common causes

  • Uneven stream placement. One server hosts more R1 streams or acts as leader for more R3 streams than its peers. Each stream’s message store, subject index, and metadata consume memory. If stream placement isn’t balanced — either through manual assignment or because placement tags aren’t configured — one server accumulates more state than others.

  • Connection hotspot. A disproportionate number of clients are connected to one server. Each connection consumes memory for the connection state, subscription interest, and pending message buffers. If clients are configured with a single server URL instead of the full cluster list, or a load balancer isn’t distributing evenly, one server bears the connection load.

  • Large pending buffers for slow consumers. When a client reads slowly, the server buffers messages in memory for that connection. A handful of slow consumers on one server can push its memory well above the cluster average, especially on high-throughput subjects. This often co-occurs with slow consumer warnings (SERVER_004).

  • JetStream cache divergence. NATS servers cache recently accessed stream data in memory. If one server handles most reads for a large stream (common with R1 streams or when consumer leaders are concentrated), its cache grows larger than peers that aren’t serving the same read load.

  • High subject cardinality on hosted streams. Streams with millions of unique subjects maintain per-subject indexes in memory. If one server hosts a stream with significantly higher subject cardinality than streams on other servers, its memory usage diverges.

  • Memory-backed streams concentrated on one server. Memory-type JetStream streams store all data in RAM. If several memory-backed streams happen to be placed on the same server — or their leaders are all on the same node — that server’s memory usage will be much higher than peers using file-backed storage.

How to diagnose

Confirm the memory imbalance

List all servers and compare memory usage across the cluster:

Terminal window
nats server list

Look at the Mem column. Compare the outlier’s value against other servers in the same cluster. A server using 1.5× or more of the average warrants investigation.

For more detail:

Terminal window
nats server report jetstream

This shows per-server JetStream memory and storage usage, which often accounts for the bulk of the difference.

Check stream placement and leader distribution

Terminal window
nats stream report

Look at the Cluster column and the leader assignments. If the outlier server appears as leader for significantly more streams than its peers, that’s likely contributing to the imbalance.

To see all streams across accounts:

Terminal window
nats stream ls -a

Check connection distribution

Terminal window
nats server report connections

Compare connection counts across servers. If the outlier has 2–3× more connections than its peers, the connection overhead alone may explain the memory difference.

Check for slow consumers

Terminal window
curl -s 'http://localhost:8222/connz?sort=pending_bytes&limit=20' | jq '.connections[]'

Connections with high pending bytes are buffering messages in server memory. If these are concentrated on the outlier server, they’re contributing to the memory spike.

Inspect server-level memory breakdown

For a specific server, use the monitoring endpoint:

Terminal window
curl -s http://<outlier-host>:8222/varz | jq '{mem, connections, subscriptions, slow_consumers, jetstream: .jetstream}'

Compare the output against a healthy peer to identify which category is driving the difference.

How to fix it

Immediate: reduce memory pressure

If the outlier is approaching its memory limit and you need to act fast, redistribute stream leaders away from the overloaded server:

Terminal window
# Step down leaders on the outlier to trigger re-election on other servers
nats stream cluster step-down <stream-name>

Repeat for streams where the outlier is leader. This is non-disruptive — consumers reconnect to the new leader automatically.

If slow consumers are contributing, identify and address the worst offenders:

Terminal window
# Find connections with highest pending bytes
curl -s 'http://localhost:8222/connz?sort=pending_bytes&limit=20' | jq '.connections[]'

Consider temporarily disconnecting clients that are buffering excessive data on the outlier.

Short-term: rebalance workload

Redistribute connections. Ensure all clients are configured with the full cluster URL list so they connect to different servers on reconnect:

1
// Go client — provide all cluster servers
2
nc, err := nats.Connect("nats://s1:4222,nats://s2:4222,nats://s3:4222")
1
# Python (nats.py)
2
nc = await nats.connect(servers=["nats://s1:4222", "nats://s2:4222", "nats://s3:4222"])

Use placement tags to distribute streams. If streams are concentrating on one server, configure placement tags to spread them:

1
jetstream {
2
unique_tag: "az:az1"
3
}

Then create streams with placement constraints:

Terminal window
nats stream edit <stream-name> --placement-tags "az:az1,az:az2,az:az3"

Reduce memory-backed stream concentration. Move memory-type streams to file-backed storage where low-latency access isn’t critical:

Terminal window
nats stream edit <stream-name> --storage file

Long-term: prevent recurrence

Monitor per-server memory continuously. Export the mem field from /varz to your monitoring stack and alert on divergence before Insights catches it.

Automate leader rebalancing. Run periodic leader step-down operations to prevent leaders from accumulating on one server over time. The nats server report jetstream command shows leader counts per server — if they drift, redistribute.

Standardize stream placement. Use placement tags and organizational policies to ensure streams are created with balanced placement from the start. Document when R1 (single replica) is acceptable versus when R3 is required, and enforce it through account-level JetStream limits.

Frequently asked questions

How much memory variance between cluster servers is normal?

Some variance is expected — servers handle different connections and cache different stream data. A difference of 10–20% is typical in a healthy cluster. The default threshold of 1.5× (50% above average) is intentionally generous to avoid false positives. If you consistently see one server 30%+ above its peers, investigate even if it hasn’t triggered the check.

Can a memory outlier cause other servers to fail?

Not directly — each server manages its own memory independently. However, if the outlier crashes from OOM, the sudden loss of its connections, routes, and Raft group memberships creates cascading work for the remaining servers. They inherit the reconnecting clients, participate in leader elections, and may themselves become memory-stressed if the workload was already uneven.

Does restarting the outlier server fix the problem?

Temporarily, yes — a restart clears JetStream caches and pending buffers. But if the underlying cause is uneven placement or connection distribution, the server will accumulate excess memory again. Restart only buys time; you need to address the root cause (rebalance streams, fix connection distribution, resolve slow consumers).

How do placement tags prevent memory outliers?

Placement tags let you constrain where stream replicas are placed. By assigning different tags to different servers (e.g., availability zones) and requiring streams to span multiple tags, you prevent the cluster from concentrating all stream leaders on one node. This distributes the memory load of stream state and caching more evenly.

Should I give the outlier server more memory instead of rebalancing?

Giving one server more memory than its peers masks the problem and creates a permanent imbalance. If that server goes down, its peers — with less memory — must absorb its workload. They’re less equipped to handle it. Rebalancing is almost always the better approach. The exception is if the server legitimately handles a different workload class (e.g., it’s the designated node for memory-backed streams), in which case the check threshold should be adjusted.

Proactive monitoring for NATS memory usage outlier with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel