Checks/CONSUMER_007

NATS Waiting Critical: Causes and Remediation

Severity
Critical
Category
Health
Applies to
Consumer
Check ID
CONSUMER_007
Detection threshold
num_waiting exceeds the operator-defined io.nats.monitor.waiting-critical threshold

Waiting critical means a JetStream pull consumer has more outstanding pull requests than the operator-defined threshold. Each pull request represents a consumer instance asking the server for messages that aren’t available yet. A high num_waiting count indicates that consumer demand far exceeds the message supply — many consumers are parked, waiting for work that isn’t arriving.

Why this matters

In a pull-based consumer model, clients send pull requests to the server, which responds with available messages or holds the request until messages arrive (long polling). The num_waiting metric counts how many pull requests are currently queued and waiting for messages.

A high waiting count signals a fundamental demand-supply mismatch. More consumer instances are polling for messages than the stream can feed. Each waiting pull request consumes server memory and a slot in the consumer’s max_waiting limit (default 512). When max_waiting is exhausted, new pull requests are rejected with a “max waiting exceeded” error, causing client-side retry loops that add network overhead without delivering any messages.

The operational cost is real even though no data is at risk. Over-provisioned consumers waste compute resources — each consumer instance uses CPU, memory, and network connections while doing no useful work. In cloud environments, this translates directly to unnecessary cost. In latency-sensitive systems, the burst of pull requests when messages finally arrive can cause a thundering herd: hundreds of waiting consumers all receive messages simultaneously, creating a processing spike that strains downstream dependencies.

Beyond resource waste, a high num_waiting count can also mask a different problem. If messages should be flowing but aren’t, the waiting count is a symptom of an upstream publishing failure, a subject filter mismatch, or a stream configuration error. Operators who dismiss the high waiting count as “just consumers being eager” may miss the fact that the data pipeline is broken.

Common causes

  • Over-provisioned consumer instances. More consumer instances are running than the message rate requires. Common in auto-scaling environments where the consumer pool scaled up during a traffic spike and didn’t scale back down. Ten instances pulling from a consumer that receives one message per minute means nine instances are perpetually waiting.

  • Message rate dropped below consumer capacity. The stream’s publish rate decreased — perhaps the upstream producer slowed down, a batch job completed, or traffic naturally declined — but the consumer pool size wasn’t adjusted. The consumers keep polling, but there’s nothing to fetch.

  • Subject filter mismatch. The consumer has a subject filter that doesn’t match what’s actually being published. Messages are flowing into the stream on orders.us.> but the consumer is filtered to orders.eu.>. The consumer keeps pulling, the server keeps returning empty results, and num_waiting climbs.

  • Aggressive pull request batching with short expiry. Clients configured with very short pull request timeouts (e.g., 1-second expires) send frequent pull requests that stack up in the waiting queue. Even moderate consumer counts can push num_waiting high when each client sends 10+ pull requests per second.

  • Consumer configured on an inactive stream. The stream exists but hasn’t received messages in hours or days. The consumer was deployed in anticipation of traffic that hasn’t materialized, or the upstream publisher was decommissioned without cleaning up downstream consumers.

  • max_waiting set too high. A very large max_waiting value (e.g., 10,000) allows an unreasonable number of pull requests to queue. The default of 512 is already generous for most workloads. Setting it higher masks over-provisioning rather than addressing it.

How to diagnose

Check the consumer’s waiting count

Terminal window
nats consumer info ORDERS my-consumer --json | jq '{
num_waiting: .num_waiting,
num_pending: .num_pending,
num_ack_pending: .num_ack_pending,
config_max_waiting: .config.max_waiting
}'

If num_waiting is near max_waiting, pull requests are likely being rejected.

Confirm whether messages are flowing

Terminal window
# Check if the stream is receiving messages
nats stream info ORDERS --json | jq '{
messages: .state.messages,
last_seq: .state.last_seq,
first_ts: .state.first_ts,
last_ts: .state.last_ts
}'

If last_ts is old (hours or days ago), the stream isn’t receiving new messages — the consumer is waiting for nothing.

Check the subject filter

Terminal window
# Compare consumer filter against stream subjects
nats consumer info ORDERS my-consumer --json | jq '.config.filter_subject'
nats stream info ORDERS --json | jq '.config.subjects'

Ensure the consumer’s filter subject is a subset of or matches the stream’s configured subjects.

Monitor waiting count over time

Terminal window
watch -n 5 'nats consumer info ORDERS my-consumer --json | jq "{waiting: .num_waiting, pending: .num_pending, ack_pending: .num_ack_pending}"'

A stable high num_waiting with zero num_pending and zero num_ack_pending confirms the demand-supply mismatch.

Programmatic detection across consumers

1
import (
2
"fmt"
3
"github.com/nats-io/nats.go"
4
)
5
6
func checkWaitingCritical(js nats.JetStreamContext, streamName string, threshold int) error {
7
for consumer := range js.ConsumerNames(streamName) {
8
info, err := js.ConsumerInfo(streamName, consumer)
9
if err != nil {
10
continue
11
}
12
if info.NumWaiting > threshold {
13
pct := float64(info.NumWaiting) / float64(info.Config.MaxWaiting) * 100
14
fmt.Printf("CRITICAL: stream=%s consumer=%s waiting=%d max=%d (%.1f%%) pending=%d\n",
15
streamName, consumer, info.NumWaiting,
16
info.Config.MaxWaiting, pct, info.NumPending)
17
}
18
}
19
return nil
20
}
1
import asyncio
2
import nats
3
4
async def check_waiting_critical(stream_name: str, threshold: int):
5
nc = await nats.connect()
6
js = nc.jetstream()
7
8
async for consumer_name in js.consumer_names(stream_name):
9
info = await js.consumer_info(stream_name, consumer_name)
10
if info.num_waiting > threshold:
11
pct = (info.num_waiting / info.config.max_waiting) * 100
12
print(f"CRITICAL: stream={stream_name} consumer={consumer_name} "
13
f"waiting={info.num_waiting} max={info.config.max_waiting} "
14
f"({pct:.1f}%) pending={info.num_pending}")
15
16
await nc.close()
17
18
asyncio.run(check_waiting_critical("ORDERS", 100))

How to fix it

Immediate: reduce consumer instances

If the root cause is over-provisioning, scale down the consumer pool:

Terminal window
# For Kubernetes deployments
kubectl scale deployment order-consumer --replicas=2

Match the consumer instance count to the actual message rate. A good heuristic: each instance should process at least one message per pull request cycle. If an instance is idle more than 80% of the time, you have too many instances.

Fix the subject filter

If the consumer’s filter doesn’t match published subjects, update it:

Terminal window
nats consumer edit ORDERS my-consumer --filter "orders.us.>"

Or, if the consumer should receive all messages:

Terminal window
nats consumer edit ORDERS my-consumer --filter ""

Tune pull request behavior

Increase the pull request expiry. Longer expiry reduces the frequency of new pull requests, lowering the waiting count:

1
// Instead of short, aggressive pulls
2
msgs, _ := sub.Fetch(10, nats.MaxWait(1*time.Second)) // creates many waiting requests
3
4
// Use longer poll intervals
5
msgs, _ := sub.Fetch(10, nats.MaxWait(30*time.Second)) // fewer waiting requests
1
# Longer wait reduces request frequency
2
msgs = await sub.fetch(10, timeout=30)

Use heartbeat-based pulls. Modern NATS client libraries support idle heartbeats on pull requests, which keep a single long-lived pull request alive rather than creating many short-lived ones:

1
sub, _ := js.PullSubscribe("orders.>", "my-consumer")
2
msgs, _ := sub.Fetch(100,
3
nats.MaxWait(60*time.Second),
4
nats.PullHeartbeat(5*time.Second),
5
)

Adjust max_waiting

If the current max_waiting is unnecessarily high, reduce it to match your actual consumer count:

Terminal window
nats consumer edit ORDERS my-consumer --max-waiting 50

Set max_waiting to roughly 2x the expected number of concurrent consumer instances. This provides headroom for transient pull request overlap without allowing unbounded queue growth.

Remove unused consumers

If the consumer is no longer needed (upstream publisher decommissioned, workload migrated), remove it:

Terminal window
nats consumer rm ORDERS my-consumer -f

Synadia Insights flags consumers that combine high num_waiting with zero throughput over extended periods, helping you identify candidates for removal.

Frequently asked questions

Is num_waiting dangerous or just wasteful?

Primarily wasteful. High num_waiting doesn’t cause data loss or message corruption. The risks are resource waste (server memory, client compute), potential thundering herd when messages do arrive, and masking upstream problems. It’s an efficiency and operational clarity issue.

How do I set the threshold for this check?

Set the io.nats.monitor.waiting-critical metadata key on the stream or consumer configuration. Choose a value based on your expected consumer instance count: if you run 5 instances, a num_waiting of 50 means each instance has 10 queued pulls, which is likely excessive. A threshold of 2-3x your instance count is a reasonable starting point.

What happens when max_waiting is exceeded?

New pull requests are rejected by the server with a “max waiting requests exceeded” error. Well-behaved clients retry with backoff, but poorly configured clients may retry aggressively, creating a tight loop of rejected requests that wastes network and CPU. If you’re seeing this error, either reduce the number of consumer instances or increase max_waiting.

Does this check apply to push consumers?

No. Push consumers don’t use pull requests — the server pushes messages directly to the client. num_waiting is only relevant for pull consumers. Push consumers have different health indicators like num_ack_pending (CONSUMER_006) and num_pending (CONSUMER_008).

Can I set num_waiting to zero by having consumers only pull when they know messages exist?

In theory, yes — you could only pull after receiving an advisory or notification that messages are available. In practice, long-polling (pull with a reasonable timeout) is the standard pattern. A small num_waiting (1 per consumer instance) is normal and expected. The concern is when num_waiting grows far beyond the number of active instances.

Proactive monitoring for NATS waiting critical with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel