Checks/META_004

NATS Meta Snapshot Slow: What It Means and How to Fix It

Severity
Warning
Category
Performance
Applies to
Meta Cluster
Check ID
META_004
Detection threshold
meta snapshot duration exceeds warning (5s) or critical (30s) threshold

A slow meta snapshot means the JetStream meta cluster is taking longer than expected to write a Raft snapshot to disk. When the snapshot duration exceeds the warning threshold (5 seconds) or the critical threshold (30 seconds), JetStream API operations can stall, leader elections may time out, and the entire cluster’s ability to manage streams and consumers degrades.

Why this matters

The meta cluster is the control plane for all of JetStream. It tracks every stream, every consumer, and every replica placement across your entire NATS deployment. Raft periodically snapshots this state to compact the log — without snapshots, the Raft log grows unbounded and recovery after restart takes progressively longer. When a snapshot takes too long, everything downstream is affected.

During a snapshot, the meta group leader serializes the entire JetStream asset catalog to disk. While this happens, new Raft proposals (stream creation, consumer updates, leader elections for asset groups) queue up. A snapshot that takes 5 seconds means 5 seconds where JetStream API calls — nats stream create, nats consumer add, even nats stream info in some cases — block or return errors. At 30 seconds, you’re in territory where Raft election timeouts fire, followers assume the leader is dead, and the meta group starts unnecessary leader elections. This compounds the problem: the new leader also needs to snapshot eventually, and if the underlying issue isn’t resolved, it hits the same wall.

The blast radius is cluster-wide. Unlike a slow stream Raft group that affects one stream, a slow meta snapshot affects every JetStream operation in the deployment. If you run multi-tenant workloads, every account’s JetStream API calls stall simultaneously. Operators often mistake this for network issues or server crashes because the symptoms — timeouts, failed API calls, leader changes — look similar.

Common causes

  • Slow disk I/O. The most common cause. HDDs, network-attached storage, or shared volumes introduce latency during the sequential write of the snapshot file. Meta snapshots are write-intensive bursts — even if your disks handle steady-state stream writes fine, the snapshot’s large sequential write can saturate I/O bandwidth.

  • Large meta state from excessive JetStream assets. Every stream and consumer replica adds to the meta state. A cluster with 5,000+ Raft groups produces a snapshot that’s significantly larger than one with 500. The serialization time scales with asset count, and so does the write time.

  • Disk contention from stream writes. JetStream stream data and meta snapshots share the same store_dir by default. During high-throughput stream writes, the disk is already busy. The meta snapshot competes for I/O bandwidth, making both slower.

  • Memory pressure causing swap. If the server is swapping, the snapshot serialization step — which walks in-memory data structures — becomes orders of magnitude slower as pages are faulted in from disk.

  • Overloaded server CPU. Snapshot serialization is CPU-bound before it becomes I/O-bound. On a server already running at high CPU from message routing or TLS termination, the snapshot process competes for CPU time.

How to diagnose

Check meta group health

Terminal window
nats server report jetstream

This shows the meta group status including which server is the leader, replica lag, and whether any peers are offline. Look for high lag values on followers — this often correlates with slow snapshots on the leader.

Query the meta leader directly

Terminal window
nats server req jetstream --leader

This queries the current meta leader for JetStream state, including asset counts. A large number of total streams and consumers directly correlates with snapshot size.

Check disk I/O performance

On the meta leader server, check disk latency during snapshot windows:

Terminal window
# Check disk I/O stats (Linux)
iostat -xz 1 10
# Look for high await (average I/O wait time) and low throughput on the JetStream volume

Sustained await values above 10ms on SSD or 50ms on HDD during snapshot windows confirm disk I/O as the bottleneck.

Check JetStream asset count

Terminal window
# Count total streams across the cluster
nats stream list -a --json | jq length
# Count total consumers
nats consumer report --all

If the total Raft group count (streams × replicas + consumers × replicas + meta group) exceeds 5,000, snapshot size is likely contributing.

Monitor via the /jsz endpoint

Terminal window
curl -s http://localhost:8222/jsz | jq '{meta_leader: .meta.leader, meta_cluster_size: .meta.cluster_size, streams: .streams, consumers: .consumers}'

How to fix it

Immediate: reduce snapshot pressure

Move JetStream storage to fast SSDs. If you’re on HDDs or network-attached storage, this is the single highest-impact change. Meta snapshots are write-intensive — SSD latency (sub-millisecond) vs HDD latency (5-15ms) makes the difference between a 1-second snapshot and a 30-second snapshot:

nats-server.conf
1
jetstream {
2
store_dir: "/fast-ssd/nats/jetstream"
3
}

Separate JetStream storage from OS and log volumes. Disk contention between stream writes, OS operations, and meta snapshots is a common amplifier. Dedicate a volume for store_dir.

Short-term: reduce meta state size

Remove unused streams and consumers. Every stream and consumer adds to the meta state. Audit for inactive assets:

Terminal window
# Find streams with no recent messages
nats stream list -a
# Find consumers with no recent deliveries
nats consumer report --all

Delete any streams and consumers that are no longer needed. Each removal directly reduces snapshot size.

Reduce replica counts where appropriate. R3 streams create three Raft groups tracked by meta. If some streams don’t need high availability (development streams, temporary imports), reduce them to R1:

Terminal window
nats stream edit <stream-name> --replicas=1

Long-term: architect for meta health

Implement JetStream asset governance. Set organizational limits on stream and consumer creation. Use account-level JetStream limits to prevent runaway asset creation:

1
// Go: set account limits via JWT
2
claims := jwt.NewAccountClaims(accountPub)
3
claims.Limits.JetStreamLimits.Streams = 50
4
claims.Limits.JetStreamLimits.Consumer = 200

Monitor disk I/O continuously. Set up Prometheus alerts on disk latency for the JetStream volume. Alert before snapshots become slow.

Consider dedicated JetStream servers. In large deployments, separating JetStream-heavy workloads from core NATS routing reduces CPU and I/O contention that amplifies snapshot latency.

Frequently asked questions

How long should a NATS meta snapshot take?

On properly provisioned hardware (SSD storage, adequate CPU), meta snapshots for clusters with under 1,000 Raft groups should complete in under 1 second. Clusters with 1,000-5,000 groups typically snapshot in 1-3 seconds. Anything over 5 seconds warrants investigation, and over 30 seconds indicates a serious infrastructure or sizing problem.

Does a slow meta snapshot cause message loss?

No — slow meta snapshots don’t directly cause message loss. Stream Raft groups handle message replication independently of the meta group. However, prolonged meta stalls can trigger cascading effects: meta leader elections, which temporarily halt JetStream API operations, which can cause client-side timeouts on publish with nats.MaxWait. The risk is indirect but real.

What’s the difference between META_004 and META_005?

META_004 (Meta Snapshot Slow) measures how long each snapshot takes to write — it’s a performance issue. META_005 (Meta State Growth) measures how many Raft groups exist — it’s a capacity issue. They’re closely related: large meta state (META_005) is one of the primary causes of slow snapshots (META_004). Fixing META_005 often resolves META_004.

Can I disable meta snapshots?

No. Raft snapshots are an integral part of the consensus protocol in NATS and cannot be disabled. Without snapshots, the Raft log grows indefinitely, and server recovery after restart becomes progressively slower. The fix is always to make snapshots faster, not to skip them.

How does meta snapshot performance affect JetStream API latency?

During a snapshot, the meta leader continues processing Raft proposals but may do so more slowly due to I/O contention. JetStream API calls that require meta consensus — creating streams, adding consumers, updating placements — can see elevated latency proportional to the snapshot duration. Read-only operations like nats stream info that hit the leader also compete for the same I/O path.

Proactive monitoring for NATS meta snapshot slow with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel