Checks/META_008

NATS Meta Pending High: What It Means and How to Fix It

Severity
Warning
Category
Performance
Applies to
Meta Cluster
Check ID
META_008
Detection threshold
pending Raft operations on the meta cluster leader exceeds configured maximum (default: 500)

Meta pending high means the NATS meta cluster leader has more queued Raft operations than it can apply — the meta group is falling behind on consensus. The meta cluster is the Raft group that manages all JetStream assets — streams, consumers, and their placement — across the cluster. When pending operations accumulate, every JetStream API call (create stream, add consumer, update configuration) slows down or times out, affecting the entire deployment.

Why this matters

The meta cluster is the control plane for JetStream. Every stream creation, consumer assignment, leader election, and placement decision flows through the meta group’s Raft log. When the meta leader can’t apply operations fast enough, the pending queue grows. Clients experience this as increasing latency on JetStream API calls — a stream creation that normally takes 50ms might take seconds or fail with a timeout.

The problem compounds under load. When JetStream API calls time out, clients often retry, adding more operations to the already-overloaded queue. Automated systems — Kubernetes operators, deployment scripts, monitoring tools polling $JS.API.INFO — can generate hundreds of requests per second. Each retry increases pressure on the meta leader, turning a transient slowdown into a sustained backlog. In severe cases, the meta leader steps down under pressure, triggering a leader election that briefly halts all JetStream API operations cluster-wide.

High meta pending is also an early warning for META_006 (Meta Quorum Lost). If the pending queue grows because followers can’t acknowledge fast enough — due to network partitions or resource exhaustion — the meta group’s health is degrading. Catching high pending early lets you intervene before the meta group loses quorum entirely, which would stall all JetStream operations.

Common causes

  • High rate of JetStream API operations. Bulk stream or consumer creation — during deployment, migration, or automated provisioning — floods the meta group with proposals. Each operation is a Raft proposal that must be committed by a majority and then applied. At hundreds of operations per second, even a healthy meta group can fall behind.

  • Slow disk I/O on the meta leader. Raft commits require durable writes to the WAL (write-ahead log). If the meta leader’s disk is slow — spinning disk, shared storage, I/O contention from co-located JetStream streams — each commit takes longer, and the pending queue grows. This is the most common hardware-related cause.

  • Large meta state. The meta group tracks every stream and consumer replica in the cluster. Deployments with thousands of Raft groups (META_005 threshold: 5,000) have proportionally larger snapshots and more state to manage. Periodic snapshots block the apply loop, causing transient pending spikes.

  • Network latency between meta group members. Raft requires a majority acknowledgment for each commit. If network round-trip time between the leader and followers is high — cross-region deployments, congested links — commit latency increases and pending operations accumulate while waiting for follower responses.

  • CPU pressure on the meta leader. The meta leader must serialize proposals, manage the Raft log, apply committed entries, and handle snapshots. If the server also handles heavy message routing or hosts many stream leaders, CPU contention slows the Raft apply loop. This is especially common on undersized servers handling both data plane and control plane work.

How to diagnose

Check meta group status

View the meta cluster state including pending operations:

Terminal window
nats server report jetstream

The meta group section shows the leader, followers, and their current state. Look for the pending/lag column — values above 500 (the default threshold) indicate the leader is falling behind.

For more detailed Raft state:

Terminal window
nats server request jetstream --leader

Check JetStream API latency

If meta pending is high, JetStream API calls will be slow. Measure current API responsiveness:

Terminal window
# Time a simple JetStream info request
time nats account info

If this takes more than a second, the meta group is under pressure.

Check the meta leader’s resource usage

Identify which server is the meta leader and inspect its resource usage:

Terminal window
# Find the meta leader
nats server report jetstream
# Check CPU and memory on that server
nats server report connections --sort out-msgs

Also check disk I/O on the meta leader’s host — this is often the bottleneck but isn’t visible through the NATS CLI. Use OS-level tools (iostat, iotop) on the server host.

Check the total Raft group count

A large number of Raft groups increases meta state size and snapshot time:

Terminal window
nats server report jetstream --streams --consumers

Count the total number of stream and consumer replicas. If this exceeds several thousand, meta state size is likely a contributing factor.

Check network latency between meta group members

Terminal window
nats server list

Look specifically at route connections between the servers that form the meta group. RTT above 10ms within the same datacenter or above 50ms across regions adds commit latency.

How to fix it

Immediate: reduce API pressure

Throttle or pause automated JetStream operations. If a deployment script or operator is bulk-creating streams and consumers, slow it down. Add delays between operations to let the meta group drain its pending queue:

1
// Go — add delay between bulk JetStream operations
2
js, _ := nc.JetStream()
3
for _, cfg := range streamConfigs {
4
_, err := js.AddStream(&cfg)
5
if err != nil {
6
log.Printf("Failed to create stream %s: %v", cfg.Name, err)
7
}
8
time.Sleep(100 * time.Millisecond) // Let meta group catch up
9
}
1
# Python — throttle bulk JetStream API calls
2
import asyncio
3
import nats
4
5
async def main():
6
nc = await nats.connect()
7
js = nc.jetstream()
8
9
for cfg in stream_configs:
10
await js.add_stream(cfg)
11
await asyncio.sleep(0.1) # Avoid overwhelming meta group

Check server CPU, disk I/O, and network latency on the meta leader. These are the three most common bottlenecks. Use OS-level tools (iostat, top, ping) on the meta leader host to identify which resource is saturated.

Consider reducing JetStream API request rate. If a deployment script, operator, or monitoring tool is generating high API volume, throttle it to let the meta group drain its pending queue.

Step down the meta leader to a faster server. If the current leader is on a resource-constrained node, forcing a leader election to a server with faster disks or more CPU can immediately reduce pending:

Terminal window
nats server request jetstream --leader --step-down

Short-term: address the bottleneck

Ensure fast storage on all meta group servers. NVMe SSDs are strongly recommended for servers participating in the meta group. Raft WAL writes are the critical path — every millisecond of disk latency adds directly to commit time. If the meta leader is on shared or slow storage, migrate it to dedicated fast storage.

Reduce the total Raft group count. Each stream replica and consumer replica is a separate Raft group tracked by the meta cluster. Consolidate small streams, remove unused consumers, and reduce replica counts where R3 isn’t necessary:

Terminal window
# Find inactive streams that could be removed
nats stream list --all
# Find consumers with no pending messages
nats consumer list <stream_name> --all

Reduce noisy API polling. Monitoring tools that call $JS.API.INFO or $JS.API.STREAM.LIST on tight intervals add meta group load. Increase polling intervals or use advisories instead of polling where possible.

Long-term: scale the control plane

Separate control plane from data plane. Dedicate specific servers to the meta group that don’t also handle heavy message routing. In large deployments, the meta leader should have ample CPU, fast NVMe storage, and low-latency network connections to other meta group members.

Batch JetStream provisioning. Design your provisioning workflows to create streams and consumers in controlled batches rather than all at once. This is especially important in CI/CD pipelines and Kubernetes operators that may create many JetStream assets during deployment.

Monitor meta pending as a leading indicator. Set up alerting on meta pending count before it reaches the threshold. A gradual upward trend in pending — even below 500 — indicates the meta group is approaching its capacity. Address it proactively rather than waiting for API timeouts.

Frequently asked questions

What happens if meta pending keeps growing?

If the meta leader can’t drain its pending queue, JetStream API operations start timing out. Clients receive errors when creating or modifying streams and consumers. In extreme cases, the meta leader may step down due to resource pressure, triggering a leader election. During election, all JetStream API operations are briefly unavailable. If the underlying cause (slow disk, high API rate) isn’t addressed, the new leader will experience the same problem.

Is meta pending high the same as JS API pending high (JETSTREAM_005)?

They’re related but different. JETSTREAM_005 measures the number of inflight JetStream API requests waiting for responses — this is the client-facing queue. META_008 measures the Raft-level pending operations on the meta leader — this is the internal consensus queue. High JS API pending can cause high meta pending (more API requests mean more Raft proposals), but meta pending can also be high due to disk or network issues even at moderate API request rates.

How many Raft groups is too many?

The META_005 check uses a default threshold of 5,000 total Raft groups. In practice, the meta group handles a few thousand groups comfortably on modern hardware with NVMe storage. Beyond 5,000, snapshot times increase and the meta leader needs more CPU and I/O capacity. If you’re seeing meta pending high with fewer than 5,000 groups, the bottleneck is more likely disk I/O or network latency than state size.

Can I increase the meta pending threshold?

Yes — the threshold is configurable in Synadia Insights. But increasing the threshold only suppresses the alert; it doesn’t fix the underlying performance issue. A higher threshold is appropriate if your workload has periodic bursts of JetStream API activity (e.g., scheduled deployments) that temporarily spike pending but drain quickly. If pending is sustained above the threshold, address the root cause instead.

How do I prevent meta pending spikes during deployments?

Stagger stream and consumer creation. Instead of creating 200 streams simultaneously, batch them in groups of 10-20 with short pauses between batches. Most Kubernetes operators and infrastructure-as-code tools support rate limiting or parallelism controls. Also avoid scheduling multiple deployments that create JetStream assets at the same time.

Proactive monitoring for NATS meta pending high with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel