JetStream storage skew means one server in your NATS cluster uses more than double the cluster average disk storage. This imbalance signals that data is concentrated on specific nodes rather than distributed evenly, creating a single point of storage pressure that can fill before other servers are even moderately utilized.
Disk capacity is finite, and the server that fills first determines your cluster’s effective storage ceiling. If one server holds 80 GiB of JetStream data while its peers hold 20 GiB each, that server hits its storage reservation limit while 75% of the cluster’s aggregate disk capacity remains unused. New streams can’t be placed on the full server, and existing streams on it start rejecting writes when storage is exhausted — regardless of how much space is free elsewhere.
Storage skew also affects I/O performance. The server with the most data handles the most disk reads and writes: compaction, snapshot creation, and message retrieval all scale with data volume. During periods of high write throughput, the overloaded server’s disk becomes the bottleneck. Raft log replication to that server slows because its disk queue is deeper, which in turn increases replication lag for every stream it hosts. The cluster appears healthy, but one server is quietly degrading.
The risk becomes acute during recovery scenarios. If the storage-heavy server restarts, it must replay or restore more data than any other server, extending its recovery time. Other servers that depend on it for Raft quorum wait longer for it to catch up. The server that should be the same as any other becomes the critical path in every failure and recovery scenario.
A few large streams placed on the same server. One or two high-volume streams with large retention windows can dominate storage on whatever server hosts their leader or replicas. If these streams were created without placement constraints, they may all land on the same node.
R1 streams concentrated on one server. Unreplicated (R1) streams exist only on their leader’s server. If many R1 streams were created while one server happened to be the meta leader’s preferred target, all their storage accumulates on that single node.
Uneven retention policies. Streams with max_age of 30 days accumulate far more data than streams with 24-hour retention. If long-retention streams cluster on one server, that server’s storage grows disproportionately — even if replica counts are balanced.
Streams with different message sizes. A stream receiving 1 KB messages and another receiving 100 KB messages have very different storage footprints even at the same message rate. If the large-message streams land on the same server, storage skews quickly.
No storage rebalancing after cluster changes. Adding a new server to the cluster doesn’t redistribute existing data. The new server starts empty while existing servers retain their accumulated storage. Without explicit migration, the skew persists indefinitely.
nats server report jetstreamThe output shows File and Memory storage per server. Compare the File column across all servers. A server using more than 2× the cluster average is skewed.
nats stream reportThis lists all streams with their byte sizes and cluster placement. Sort mentally or pipe through jq to find the largest streams and note which servers host them.
# Direct monitoring endpoint for a specific servercurl -s http://localhost:8222/jsz | jq '{memory: .memory, storage: .storage, streams: .streams, consumers: .consumers}'Check this endpoint on each server to get exact byte counts. The delta between the highest and lowest server storage reveals the skew magnitude.
nats stream info <large_stream_name>For the largest streams identified above, check which servers host their replicas. If the same server appears repeatedly as a replica host for large streams, that explains the storage skew.
nats server report jetstreamCompare each server’s used storage against its reserved storage. The skewed server may be approaching its reservation limit while others have abundant headroom. If it’s above 90%, check SERVER_005 (JetStream Resource Pressure) as well.
If the skewed server is approaching its storage limit, purge data from its largest streams or apply stricter retention:
# Purge all messages from a specific streamnats stream purge <stream_name>
# Set a retention limit to cap storagenats stream edit <stream_name> --max-bytes 10GiBThis buys time but doesn’t address the placement problem.
If the cluster is genuinely under-provisioned for the workload — every server is filling, not just the skewed one — add storage capacity. Provisioning a larger volume on the skewed server (or expanding the existing one) raises that server’s reservation ceiling and lets it stay above water while you redistribute. Re-balance after capacity is added; otherwise the new headroom just delays the next pressure event.
Move the largest streams off the overloaded server by adjusting placement:
1// Go — move a stream to a different server group via placement tags2js, _ := nc.JetStream()3
4_, err := js.UpdateStream(&nats.StreamConfig{5 Name: "EVENTS",6 Subjects: []string{"events.>"},7 Replicas: 3,8 Placement: &nats.Placement{9 Tags: []string{"storage:high"},10 },11})1// TypeScript (nats.js)2const jsm = await nc.jetstreamManager();3
4await jsm.streams.update("EVENTS", {5 subjects: ["events.>"],6 num_replicas: 3,7 placement: { tags: ["storage:high"] },8});For R1 streams, moving them requires scaling to R3 (which creates replicas on other servers), then scaling back to R1 — the new leader may land on a different server:
# Scale up to create replicas on other serversnats stream edit <stream_name> --replicas 3# Wait for sync to completenats stream info <stream_name># Step down to potentially move leadershipnats stream cluster step-down <stream_name># Scale back to R1nats stream edit <stream_name> --replicas 1Establish per-server storage budgets and placement policies:
Tag servers by storage tier. Use server tags like storage:high for servers with large disks and storage:standard for others. Direct high-volume streams to appropriate tiers.
Set stream retention limits at creation time. Every stream should have at least one of max_bytes, max_age, or max_msgs to prevent unbounded growth (see OPT_SYS_001).
Monitor storage distribution continuously. Track per-server JetStream storage as a Prometheus metric and alert when any server exceeds 1.5× the cluster average.
Review stream placement quarterly. As workloads evolve, streams grow at different rates. Periodic audits catch skew before it becomes a capacity problem.
No. NATS distributes stream replicas at creation time based on available resources, but it does not rebalance after initial placement. As streams grow at different rates and new servers are added, storage naturally drifts out of balance. Rebalancing requires manual intervention — moving streams via placement tags or cycling replica counts.
For replicated streams (R3+), yes. Update the stream’s placement tags to target different servers, and NATS will migrate replicas. During migration, the stream remains available because quorum is maintained. For R1 streams, there’s a brief window of reduced availability when cycling through R3 and back — the stream is always available, but the R1→R3→R1 transition involves temporary replication overhead.
Storage skew (this check) is about relative distribution — one server has more than its fair share. Resource pressure (SERVER_005) is about absolute utilization — a server is approaching its storage reservation limit. They often co-occur: the skewed server is typically the first to hit resource pressure. Fixing the skew (redistributing data) directly reduces pressure on the overloaded server.
The server stops accepting new writes to streams it hosts. Existing consumers can still read, but publishers get errors. If the stream is replicated, a new leader can be elected on another server — but only if the other replicas are current. If the skewed server was also the leader for its streams, write availability depends entirely on replica health on other servers.
Uniform disk sizes simplify capacity planning and make skew easier to detect. When all servers have identical storage reservations, any imbalance in usage is clearly a placement problem, not a capacity mismatch. If you must use heterogeneous storage, configure JetStream reservations proportionally and adjust your monitoring thresholds accordingly.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community