A JetStream storage saturation with skew alert fires when a server is approaching its JetStream storage capacity and the cluster simultaneously exhibits significant storage utilization imbalance between nodes. This combination is dangerous: the saturated server cannot accommodate new streams or growth, but the cluster has available capacity on other servers that isn’t being used. The problem is solvable — but only if you actively redistribute the load.
Storage saturation on its own is serious. Storage saturation combined with skew means the cluster is failing to use its resources effectively, and the failure mode is worse than it appears.
The saturated server will reject new work. When a JetStream server reaches its max_storage limit, it cannot accept new stream replicas or allow existing streams to grow. Any publish to a stream whose leader is on the saturated server will fail with a “insufficient storage” error. This is a hard stop — not a gradual degradation.
Skew means the cluster has capacity but can’t use it. If all servers were equally utilized, saturation would mean the cluster genuinely needs more storage. But skew reveals that some servers have significant free capacity. The problem isn’t insufficient cluster-wide storage — it’s that the storage is in the wrong place.
Automatic rebalancing doesn’t exist. NATS does not automatically migrate stream replicas from full servers to empty ones. Streams stay where they were placed. Without manual intervention, the saturated server continues rejecting work while peers sit partially empty. This makes the problem permanent unless an operator acts.
The next failure is amplified. If the saturated server crashes (and servers under resource pressure are more crash-prone), the streams it hosted need to be recovered. If the crash corrupts storage or the server can’t restart, stream replicas that were on that server must be rebuilt on peers — peers that hopefully have the storage capacity to absorb them. With skew, the peers likely do have capacity, but only if the streams are actively migrated before the crisis.
Initial stream placement without capacity awareness. Streams were created without placement tags or preferences, and the JetStream placement algorithm happened to favor one server — often the current meta leader or the server with the lowest latency at creation time. Over time, that server accumulated more streams than its peers.
Uneven stream growth rates. Different streams grow at different rates. A server that initially had a fair share of streams may become saturated if the streams it hosts are the highest-volume ones. Meanwhile, servers hosting lower-volume streams remain underutilized.
Server added to cluster without stream redistribution. A new server was added to increase cluster capacity, but existing streams weren’t rebalanced to use it. The new server sits empty while old servers remain full.
Replicas not spread evenly. R3 streams place replicas on 3 servers. If the cluster has 5 servers but replica placement consistently picks the same 3, those 3 servers fill up while the other 2 remain underutilized.
Heterogeneous storage configurations. Servers have different max_storage limits. A server with a smaller limit saturates faster even if it hosts the same number of streams as peers with larger limits. The skew is a side effect of mismatched provisioning.
# Show JetStream resource usage per servernats server report jetstreamLook for servers where storage utilization is above 80% while other servers in the same cluster are below 50%. The gap between the highest and lowest utilization is your skew.
# Show streams with their placement and sizesnats stream report
# Filter to see streams on a specific servernats stream report --json | jq '.streams[] | select(.cluster.leader == "server-1") | {name: .name, bytes: .state.bytes}'# Get storage stats for all serversnats server report jetstream --json | jq '[.servers[] | { name: .name, used_gb: (.storage / 1073741824), reserved_gb: (.reserved_storage / 1073741824), pct: ((.storage / .reserved_storage) * 100)}] | sort_by(-.pct)'1package main2
3import (4 "encoding/json"5 "fmt"6 "io"7 "net/http"8 "sort"9)10
11type JSZ struct {12 Streams int `json:"streams"`13 Store int64 `json:"store"`14 Reserved int64 `json:"reserved_store"`15}16
17func getJSZ(host string) (*JSZ, error) {18 resp, err := http.Get(fmt.Sprintf("http://%s:8222/jsz", host))19 if err != nil {20 return nil, err21 }22 defer resp.Body.Close()23 body, _ := io.ReadAll(resp.Body)24 var jsz JSZ25 json.Unmarshal(body, &jsz)26 return &jsz, nil27}28
29func main() {30 servers := []string{"server-1", "server-2", "server-3"}31 type stat struct {32 name string33 pct float6434 used int6435 free int6436 }37
38 var stats []stat39 for _, s := range servers {40 jsz, err := getJSZ(s)41 if err != nil {42 continue43 }44 pct := float64(jsz.Store) / float64(jsz.Reserved) * 10045 stats = append(stats, stat{46 name: s, pct: pct, used: jsz.Store,47 free: jsz.Reserved - jsz.Store,48 })49 }50
51 sort.Slice(stats, func(i, j int) bool {52 return stats[i].pct > stats[j].pct53 })54
55 for _, s := range stats {56 fmt.Printf("%-12s %5.1f%% used, %d GB free\n",57 s.name, s.pct, s.free/1024/1024/1024)58 }59
60 if len(stats) >= 2 {61 skew := stats[0].pct - stats[len(stats)-1].pct62 fmt.Printf("\nSkew: %.1f pp between highest and lowest\n", skew)63 }64}1import asyncio2import aiohttp3
4async def check_skew():5 servers = ["server-1", "server-2", "server-3"]6 stats = []7
8 async with aiohttp.ClientSession() as session:9 for server in servers:10 try:11 async with session.get(12 f"http://{server}:8222/jsz"13 ) as resp:14 data = await resp.json()15 used = data.get("store", 0)16 reserved = data.get("reserved_store", 1)17 pct = used / reserved * 10018 stats.append({19 "name": server,20 "pct": pct,21 "used_gb": used / 1024**3,22 "free_gb": (reserved - used) / 1024**3,23 })24 except Exception as e:25 print(f"Error reaching {server}: {e}")26
27 stats.sort(key=lambda s: -s["pct"])28 for s in stats:29 print(f"{s['name']:12s} {s['pct']:5.1f}% used, "30 f"{s['free_gb']:.1f} GB free")31
32 if len(stats) >= 2:33 skew = stats[0]["pct"] - stats[-1]["pct"]34 print(f"\nSkew: {skew:.1f} pp between highest and lowest")35
36asyncio.run(check_skew())Stop new stream creation on the saturated server. Use placement tags to steer new streams to servers with available capacity:
# Create new streams on underutilized servers onlynats stream add NEW_STREAM \ --subjects "new.>" \ --replicas 3 \ --tag role=storage-availablePurge or trim large streams on the saturated server. If any streams have expired or unnecessary data, reclaim space immediately:
# Check for streams with high storage that could be trimmednats stream report --json | jq '.streams[] | select(.state.bytes > 10737418240) | {name: .name, gb: (.state.bytes / 1073741824)}'
# Purge old data if retention allowsnats stream purge LARGE_STREAM --keep 1000000Move stream replicas from the saturated server. Edit stream placement to include underutilized servers:
# Edit stream placement to prefer underutilized serversnats stream edit ORDERS --tag region=us-east
# Or use peer-remove to move a replica off the saturated servernats stream cluster peer-remove ORDERS server-1After removing a peer, the stream will attempt to place a new replica on another server in the cluster. Verify it lands on an underutilized server:
nats stream info ORDERSMigrate in batches, not all at once. Moving many streams simultaneously creates a replication storm that can overload the cluster. Move 1–2 streams at a time, wait for them to fully sync, then move the next batch:
# Move one stream, wait for sync, repeatnats stream cluster peer-remove ORDERS server-1# Wait for "current" status on all replicaswatch nats stream info ORDERS --json | jq '.cluster.replicas[].current'Use placement tags for capacity-aware distribution. Tag servers by their storage tier and use those tags when creating streams:
1server_tags: ["storage:large", "region:us-east"]# Create streams with placement preferencesnats stream add EVENTS \ --subjects "events.>" \ --replicas 3 \ --tag storage=largeImplement periodic rebalancing. Schedule a recurring job that checks storage utilization across servers and migrates replicas when skew exceeds a threshold. This prevents gradual drift from turning into an emergency.
Standardize server storage configurations. Ensure all servers in the cluster have the same max_storage limit (or at least similar capacities). Heterogeneous configurations are a primary source of skew.
Monitor with Synadia Insights. Insights detects the combination of saturation and skew automatically, alerting you when the conditions exist — before the saturated server actually runs out of space. This gives you a window to act proactively rather than reactively.
As a rule of thumb, if the highest-utilized server is more than 30 percentage points above the lowest, the skew is significant enough to warrant rebalancing. For example, one server at 85% and another at 40% is a 45 pp skew — that’s a clear candidate for stream migration.
There’s no built-in auto-rebalancer in NATS, but you can script it using nats stream cluster peer-remove and placement tags. The key is to migrate one stream at a time and wait for the new replica to fully sync before moving the next. Automating this with appropriate safety checks (verify sync status, check target server capacity) is a good operational investment.
It fixes the immediate saturation risk but not the skew. If the underlying cause is uneven stream placement, the same server will fill up again — it’ll just take longer. Adding storage is a valid short-term measure, but you should also redistribute streams to prevent recurrence.
R1 streams have a single copy, so there are no replicas to redistribute. However, R1 streams can still be migrated by recreating them on a different server (with downtime) or by converting them to R3, letting the replicas sync, then converting back to R1 with the desired placement. For R1 workloads, the fix is primarily about steering future stream creation to underutilized servers.
That’s a different problem — the cluster genuinely needs more storage capacity. Add servers, increase max_storage limits, or reduce data retention. This check specifically flags the combination of saturation plus skew, where the cluster has enough total capacity but it’s in the wrong place.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community