Checks/OPT_BALANCE_008

NATS JetStream Storage Saturation with Skew: Resolving Unbalanced Capacity Exhaustion

Severity
Warning
Category
Saturation
Applies to
JetStream
Check ID
OPT_BALANCE_008
Detection threshold
server near JetStream storage capacity while cluster exhibits significant storage skew between nodes

A JetStream storage saturation with skew alert fires when a server is approaching its JetStream storage capacity and the cluster simultaneously exhibits significant storage utilization imbalance between nodes. This combination is dangerous: the saturated server cannot accommodate new streams or growth, but the cluster has available capacity on other servers that isn’t being used. The problem is solvable — but only if you actively redistribute the load.

Why this matters

Storage saturation on its own is serious. Storage saturation combined with skew means the cluster is failing to use its resources effectively, and the failure mode is worse than it appears.

The saturated server will reject new work. When a JetStream server reaches its max_storage limit, it cannot accept new stream replicas or allow existing streams to grow. Any publish to a stream whose leader is on the saturated server will fail with a “insufficient storage” error. This is a hard stop — not a gradual degradation.

Skew means the cluster has capacity but can’t use it. If all servers were equally utilized, saturation would mean the cluster genuinely needs more storage. But skew reveals that some servers have significant free capacity. The problem isn’t insufficient cluster-wide storage — it’s that the storage is in the wrong place.

Automatic rebalancing doesn’t exist. NATS does not automatically migrate stream replicas from full servers to empty ones. Streams stay where they were placed. Without manual intervention, the saturated server continues rejecting work while peers sit partially empty. This makes the problem permanent unless an operator acts.

The next failure is amplified. If the saturated server crashes (and servers under resource pressure are more crash-prone), the streams it hosted need to be recovered. If the crash corrupts storage or the server can’t restart, stream replicas that were on that server must be rebuilt on peers — peers that hopefully have the storage capacity to absorb them. With skew, the peers likely do have capacity, but only if the streams are actively migrated before the crisis.

Common causes

  • Initial stream placement without capacity awareness. Streams were created without placement tags or preferences, and the JetStream placement algorithm happened to favor one server — often the current meta leader or the server with the lowest latency at creation time. Over time, that server accumulated more streams than its peers.

  • Uneven stream growth rates. Different streams grow at different rates. A server that initially had a fair share of streams may become saturated if the streams it hosts are the highest-volume ones. Meanwhile, servers hosting lower-volume streams remain underutilized.

  • Server added to cluster without stream redistribution. A new server was added to increase cluster capacity, but existing streams weren’t rebalanced to use it. The new server sits empty while old servers remain full.

  • Replicas not spread evenly. R3 streams place replicas on 3 servers. If the cluster has 5 servers but replica placement consistently picks the same 3, those 3 servers fill up while the other 2 remain underutilized.

  • Heterogeneous storage configurations. Servers have different max_storage limits. A server with a smaller limit saturates faster even if it hosts the same number of streams as peers with larger limits. The skew is a side effect of mismatched provisioning.

How to diagnose

Check per-server storage utilization

Terminal window
# Show JetStream resource usage per server
nats server report jetstream

Look for servers where storage utilization is above 80% while other servers in the same cluster are below 50%. The gap between the highest and lowest utilization is your skew.

Identify which streams are on the saturated server

Terminal window
# Show streams with their placement and sizes
nats stream report
# Filter to see streams on a specific server
nats stream report --json | jq '.streams[] | select(.cluster.leader == "server-1") | {name: .name, bytes: .state.bytes}'

Quantify the skew

Terminal window
# Get storage stats for all servers
nats server report jetstream --json | jq '[.servers[] | {
name: .name,
used_gb: (.storage / 1073741824),
reserved_gb: (.reserved_storage / 1073741824),
pct: ((.storage / .reserved_storage) * 100)
}] | sort_by(-.pct)'

Monitor programmatically

1
package main
2
3
import (
4
"encoding/json"
5
"fmt"
6
"io"
7
"net/http"
8
"sort"
9
)
10
11
type JSZ struct {
12
Streams int `json:"streams"`
13
Store int64 `json:"store"`
14
Reserved int64 `json:"reserved_store"`
15
}
16
17
func getJSZ(host string) (*JSZ, error) {
18
resp, err := http.Get(fmt.Sprintf("http://%s:8222/jsz", host))
19
if err != nil {
20
return nil, err
21
}
22
defer resp.Body.Close()
23
body, _ := io.ReadAll(resp.Body)
24
var jsz JSZ
25
json.Unmarshal(body, &jsz)
26
return &jsz, nil
27
}
28
29
func main() {
30
servers := []string{"server-1", "server-2", "server-3"}
31
type stat struct {
32
name string
33
pct float64
34
used int64
35
free int64
36
}
37
38
var stats []stat
39
for _, s := range servers {
40
jsz, err := getJSZ(s)
41
if err != nil {
42
continue
43
}
44
pct := float64(jsz.Store) / float64(jsz.Reserved) * 100
45
stats = append(stats, stat{
46
name: s, pct: pct, used: jsz.Store,
47
free: jsz.Reserved - jsz.Store,
48
})
49
}
50
51
sort.Slice(stats, func(i, j int) bool {
52
return stats[i].pct > stats[j].pct
53
})
54
55
for _, s := range stats {
56
fmt.Printf("%-12s %5.1f%% used, %d GB free\n",
57
s.name, s.pct, s.free/1024/1024/1024)
58
}
59
60
if len(stats) >= 2 {
61
skew := stats[0].pct - stats[len(stats)-1].pct
62
fmt.Printf("\nSkew: %.1f pp between highest and lowest\n", skew)
63
}
64
}
1
import asyncio
2
import aiohttp
3
4
async def check_skew():
5
servers = ["server-1", "server-2", "server-3"]
6
stats = []
7
8
async with aiohttp.ClientSession() as session:
9
for server in servers:
10
try:
11
async with session.get(
12
f"http://{server}:8222/jsz"
13
) as resp:
14
data = await resp.json()
15
used = data.get("store", 0)
16
reserved = data.get("reserved_store", 1)
17
pct = used / reserved * 100
18
stats.append({
19
"name": server,
20
"pct": pct,
21
"used_gb": used / 1024**3,
22
"free_gb": (reserved - used) / 1024**3,
23
})
24
except Exception as e:
25
print(f"Error reaching {server}: {e}")
26
27
stats.sort(key=lambda s: -s["pct"])
28
for s in stats:
29
print(f"{s['name']:12s} {s['pct']:5.1f}% used, "
30
f"{s['free_gb']:.1f} GB free")
31
32
if len(stats) >= 2:
33
skew = stats[0]["pct"] - stats[-1]["pct"]
34
print(f"\nSkew: {skew:.1f} pp between highest and lowest")
35
36
asyncio.run(check_skew())

How to fix it

Immediate: prevent the saturated server from running out of space

Stop new stream creation on the saturated server. Use placement tags to steer new streams to servers with available capacity:

Terminal window
# Create new streams on underutilized servers only
nats stream add NEW_STREAM \
--subjects "new.>" \
--replicas 3 \
--tag role=storage-available

Purge or trim large streams on the saturated server. If any streams have expired or unnecessary data, reclaim space immediately:

Terminal window
# Check for streams with high storage that could be trimmed
nats stream report --json | jq '.streams[] | select(.state.bytes > 10737418240) | {name: .name, gb: (.state.bytes / 1073741824)}'
# Purge old data if retention allows
nats stream purge LARGE_STREAM --keep 1000000

Short-term: migrate streams to underutilized peers

Move stream replicas from the saturated server. Edit stream placement to include underutilized servers:

Terminal window
# Edit stream placement to prefer underutilized servers
nats stream edit ORDERS --tag region=us-east
# Or use peer-remove to move a replica off the saturated server
nats stream cluster peer-remove ORDERS server-1

After removing a peer, the stream will attempt to place a new replica on another server in the cluster. Verify it lands on an underutilized server:

Terminal window
nats stream info ORDERS

Migrate in batches, not all at once. Moving many streams simultaneously creates a replication storm that can overload the cluster. Move 1–2 streams at a time, wait for them to fully sync, then move the next batch:

Terminal window
# Move one stream, wait for sync, repeat
nats stream cluster peer-remove ORDERS server-1
# Wait for "current" status on all replicas
watch nats stream info ORDERS --json | jq '.cluster.replicas[].current'

Long-term: prevent skew from recurring

Use placement tags for capacity-aware distribution. Tag servers by their storage tier and use those tags when creating streams:

nats-server.conf
1
server_tags: ["storage:large", "region:us-east"]
Terminal window
# Create streams with placement preferences
nats stream add EVENTS \
--subjects "events.>" \
--replicas 3 \
--tag storage=large

Implement periodic rebalancing. Schedule a recurring job that checks storage utilization across servers and migrates replicas when skew exceeds a threshold. This prevents gradual drift from turning into an emergency.

Standardize server storage configurations. Ensure all servers in the cluster have the same max_storage limit (or at least similar capacities). Heterogeneous configurations are a primary source of skew.

Monitor with Synadia Insights. Insights detects the combination of saturation and skew automatically, alerting you when the conditions exist — before the saturated server actually runs out of space. This gives you a window to act proactively rather than reactively.

Frequently asked questions

How much skew is too much?

As a rule of thumb, if the highest-utilized server is more than 30 percentage points above the lowest, the skew is significant enough to warrant rebalancing. For example, one server at 85% and another at 40% is a 45 pp skew — that’s a clear candidate for stream migration.

Can I automate stream migration?

There’s no built-in auto-rebalancer in NATS, but you can script it using nats stream cluster peer-remove and placement tags. The key is to migrate one stream at a time and wait for the new replica to fully sync before moving the next. Automating this with appropriate safety checks (verify sync status, check target server capacity) is a good operational investment.

Will adding more storage to the saturated server fix this?

It fixes the immediate saturation risk but not the skew. If the underlying cause is uneven stream placement, the same server will fill up again — it’ll just take longer. Adding storage is a valid short-term measure, but you should also redistribute streams to prevent recurrence.

Does this check apply to R1 streams?

R1 streams have a single copy, so there are no replicas to redistribute. However, R1 streams can still be migrated by recreating them on a different server (with downtime) or by converting them to R3, letting the replicas sync, then converting back to R1 with the desired placement. For R1 workloads, the fix is primarily about steering future stream creation to underutilized servers.

What if all servers are saturated with no skew?

That’s a different problem — the cluster genuinely needs more storage capacity. Add servers, increase max_storage limits, or reduce data retention. This check specifically flags the combination of saturation plus skew, where the cluster has enough total capacity but it’s in the wrong place.

Proactive monitoring for NATS jetstream storage saturation with skew with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel