Checks/CONSUMER_008

NATS Unprocessed Critical: Causes and Remediation

Severity
Critical
Category
Health
Applies to
Consumer
Check ID
CONSUMER_008
Detection threshold
num_pending exceeds the operator-defined io.nats.monitor.unprocessed-critical threshold

Unprocessed critical means a JetStream consumer has more undelivered messages waiting in its backlog than the operator-defined threshold. The num_pending metric represents messages in the stream that match the consumer’s filter but haven’t been delivered yet — the consumer’s to-do list. When this count crosses the critical threshold, the consumer is falling behind and the gap between published messages and processed messages is growing.

Why this matters

A growing num_pending backlog is the most direct indicator that a consumer can’t keep up with its workload. Unlike num_ack_pending (messages delivered but unacknowledged) or num_waiting (idle pull requests), num_pending measures the fundamental throughput deficit: messages are arriving faster than the consumer processes them.

The consequences depend on the stream’s retention configuration. With a limits-based retention policy and max_msgs or max_bytes constraints, the oldest messages in the stream are evicted as new ones arrive. If the consumer’s backlog grows large enough, unprocessed messages may be evicted before the consumer ever delivers them — permanent, silent data loss. The consumer’s num_pending count drops (because the messages no longer exist), which can paradoxically make the situation look like it’s improving when messages are actually being discarded.

Even without retention limits, a large backlog creates operational pressure. Downstream systems that depend on timely processing — dashboards, search indexes, notification pipelines, billing systems — become increasingly stale. The age of the oldest unprocessed message grows from seconds to minutes to hours. In some systems, messages older than a certain age are worthless (real-time pricing, fraud detection, session tracking), so a backlog effectively equals data loss even if the messages technically still exist.

At critical levels, the backlog also affects recovery time. If a consumer falls 10 million messages behind, even processing at 10,000 msg/s takes over 16 minutes to catch up — assuming no new messages arrive during recovery. With continued inflow, catch-up takes even longer or may never complete if the processing rate is only marginally above the publish rate.

Common causes

  • Consumer processing too slow. The most common cause. The consumer’s message processing throughput is lower than the stream’s inbound message rate. Each second, the gap between published and processed messages widens. This can be due to slow downstream dependencies, CPU-bound processing, or blocking I/O in the message handler.

  • Consumer instances disconnected or crashed. One or more consumer instances stopped processing — a crash, a deployment rollback, an OOM kill, or a network partition. The remaining instances (if any) can’t absorb the full load, so the backlog grows. If all instances are down, num_pending grows at the full publish rate.

  • Traffic spike without consumer scaling. The publish rate increased significantly — a batch import, a marketing event, a partner integration going live — but the consumer pool wasn’t scaled up to match. The consumer was sized for steady-state traffic and can’t handle the burst.

  • Consumer back-pressured by max_ack_pending. The consumer hit its max_ack_pending limit, so the server stopped delivering new messages. Messages continue flowing into the stream but can’t be delivered, causing num_pending to grow. The root cause is downstream (see CONSUMER_006), but the visible symptom is growing num_pending.

  • Consumer paused or deliberately throttled. A pull consumer that isn’t actively pulling, or a push consumer with a pause_until timestamp in the future, won’t receive messages. The backlog accumulates until the consumer resumes.

  • Subject filter too broad. The consumer’s subject filter matches more subjects than intended, pulling in messages from high-volume subjects that weren’t designed for this consumer’s processing pipeline. The aggregate rate exceeds what the consumer was provisioned to handle.

How to diagnose

Check the consumer’s pending count

Terminal window
# Read the consumer's delivered position; the stream's last sequence lives on
# stream info, not consumer info — fetch it separately.
nats consumer info ORDERS my-consumer --json | jq '{
num_pending: .num_pending,
num_ack_pending: .num_ack_pending,
num_waiting: .num_waiting,
num_redelivered: .num_redelivered,
last_delivered: .delivered.stream_seq
}'
nats stream info ORDERS --json | jq '{stream_last_seq: .state.last_seq}'

Determine if the backlog is growing or stable

Terminal window
# Sample num_pending twice, 30 seconds apart
P1=$(nats consumer info ORDERS my-consumer --json | jq '.num_pending')
sleep 30
P2=$(nats consumer info ORDERS my-consumer --json | jq '.num_pending')
RATE=$(( (P2 - P1) * 2 )) # messages per minute growth rate
echo "Pending: $P1$P2 (growth rate: $RATE msg/min)"

If the growth rate is positive, the consumer is falling further behind. If it’s zero, the consumer is keeping up but has a historical backlog. If it’s negative, the consumer is catching up.

Check whether the consumer is processing at all

Terminal window
# Compare delivered sequence over time
D1=$(nats consumer info ORDERS my-consumer --json | jq '.delivered.stream_seq')
sleep 10
D2=$(nats consumer info ORDERS my-consumer --json | jq '.delivered.stream_seq')
echo "Delivered: $D1$D2 (throughput: $(( (D2 - D1) * 6 )) msg/min)"

If the delivered sequence isn’t advancing, the consumer is stalled — not just slow. Check for max_ack_pending saturation, disconnected instances, or a paused consumer.

Check if retention is evicting unprocessed messages

Terminal window
# Compare the consumer's ack floor against the stream's first sequence
nats consumer info ORDERS my-consumer --json | jq '{
ack_floor_stream_seq: .ack_floor.stream_seq
}'
nats stream info ORDERS --json | jq '{
first_seq: .state.first_seq
}'

If first_seq is advancing toward (or past) the consumer’s ack_floor_stream_seq, retention is evicting messages the consumer hasn’t processed — data loss is occurring.

Programmatic detection

1
import (
2
"fmt"
3
"github.com/nats-io/nats.go"
4
)
5
6
func checkUnprocessedCritical(js nats.JetStreamContext, streamName string, threshold uint64) error {
7
for consumer := range js.ConsumerNames(streamName) {
8
info, err := js.ConsumerInfo(streamName, consumer)
9
if err != nil {
10
continue
11
}
12
if info.NumPending > threshold {
13
fmt.Printf("CRITICAL: stream=%s consumer=%s pending=%d threshold=%d ack_pending=%d\n",
14
streamName, consumer, info.NumPending, threshold, info.NumAckPending)
15
}
16
}
17
return nil
18
}
1
import asyncio
2
import nats
3
4
async def check_unprocessed_critical(stream_name: str, threshold: int):
5
nc = await nats.connect()
6
js = nc.jetstream()
7
8
async for consumer_name in js.consumer_names(stream_name):
9
info = await js.consumer_info(stream_name, consumer_name)
10
if info.num_pending > threshold:
11
print(f"CRITICAL: stream={stream_name} consumer={consumer_name} "
12
f"pending={info.num_pending} threshold={threshold} "
13
f"ack_pending={info.num_ack_pending}")
14
15
await nc.close()
16
17
asyncio.run(check_unprocessed_critical("ORDERS", 100000))

How to fix it

Immediate: scale the consumer pool

Add more consumer instances to increase aggregate processing throughput:

Terminal window
# Kubernetes
kubectl scale deployment order-consumer --replicas=10
# Or start additional instances manually
nats consumer next ORDERS my-consumer --count 100

For pull consumers, multiple clients can pull from the same consumer simultaneously. Each pull request gets a unique batch of messages, so scaling is linear.

Unblock a back-pressured consumer

If num_ack_pending is at max_ack_pending, the consumer is throttled. Fix the ack bottleneck first (see CONSUMER_006), then num_pending will start decreasing:

Terminal window
# Check if the consumer is at max_ack_pending
nats consumer info ORDERS my-consumer --json | jq '{
num_ack_pending: .num_ack_pending,
max_ack_pending: .config.max_ack_pending
}'
# Increase ack_wait if messages are timing out
nats consumer edit ORDERS my-consumer --wait 120s

Optimize processing throughput

Batch processing. Instead of processing one message at a time, fetch and process batches:

1
sub, _ := js.PullSubscribe("orders.>", "my-consumer")
2
for {
3
msgs, err := sub.Fetch(500, nats.MaxWait(5*time.Second))
4
if err != nil {
5
continue
6
}
7
// Process batch — e.g., bulk database insert
8
batch := make([]Order, 0, len(msgs))
9
for _, msg := range msgs {
10
var order Order
11
json.Unmarshal(msg.Data, &order)
12
batch = append(batch, order)
13
}
14
db.BulkInsert(batch)
15
for _, msg := range msgs {
16
msg.Ack()
17
}
18
}
1
async def batch_processor(js):
2
sub = await js.pull_subscribe("orders.>", "my-consumer")
3
while True:
4
msgs = await sub.fetch(500, timeout=5)
5
# Bulk process
6
records = [json.loads(msg.data) for msg in msgs]
7
await db.bulk_insert(records)
8
for msg in msgs:
9
await msg.ack()

Parallelize within a single instance. Use concurrent workers to process messages from the same pull subscription:

1
sub, _ := js.PullSubscribe("orders.>", "my-consumer")
2
sem := make(chan struct{}, 20) // 20 concurrent workers
3
4
for {
5
msgs, _ := sub.Fetch(20, nats.MaxWait(5*time.Second))
6
for _, msg := range msgs {
7
sem <- struct{}{}
8
go func(m *nats.Msg) {
9
defer func() { <-sem }()
10
processMessage(m)
11
m.Ack()
12
}(msg)
13
}
14
}

Prevent data loss from retention eviction

If the stream has retention limits that are evicting unprocessed messages, temporarily relax the limits while the consumer catches up:

Terminal window
# Increase max_msgs temporarily
nats stream edit ORDERS --max-msgs 50000000
# Or increase max_age
nats stream edit ORDERS --max-age 72h

Restore the original limits after the consumer has caught up.

Monitor continuously

Synadia Insights evaluates CONSUMER_008 against your configured io.nats.monitor.unprocessed-critical threshold, alerting when the backlog crosses the critical level. This catches growing backlogs early — before retention eviction causes data loss and before downstream staleness becomes a user-visible problem.

Frequently asked questions

How do I set the right threshold for unprocessed-critical?

Base it on your acceptable processing latency. If messages must be processed within 5 minutes and your publish rate is 1,000 msg/s, the threshold should be no more than 300,000 (5 minutes × 1,000 msg/s). For less time-sensitive workloads, you can tolerate higher backlogs. Set the threshold via the io.nats.monitor.unprocessed-critical metadata key.

What if the backlog is intentional?

Some workloads are designed for batch processing — accumulate messages during the day, process them overnight. In these cases, a high num_pending during accumulation is expected. Set the threshold to trigger only when the backlog exceeds what the batch window can process, or disable the check during accumulation hours.

Does num_pending include messages being redelivered?

No. num_pending counts messages that haven’t been delivered to the consumer at all. Messages currently in-flight (delivered but unacknowledged) are counted in num_ack_pending. A message being redelivered was already counted in num_ack_pending, not num_pending. The two metrics are complementary: total unprocessed work = num_pending + num_ack_pending.

Can I skip the backlog and only process new messages?

Yes. Reset the consumer’s deliver policy to new:

Terminal window
nats consumer edit ORDERS my-consumer --deliver-policy new

This abandons all unprocessed messages and starts fresh from the next published message. Only do this if the backlogged messages are no longer valuable. There’s no undo — skipped messages won’t be delivered to this consumer.

How is this different from CONSUMER_006 (Outstanding Ack Critical)?

CONSUMER_006 alerts on messages that were delivered but not acknowledged — they’re being processed (or failing). CONSUMER_008 alerts on messages that haven’t been delivered at all — they’re waiting in the queue. A healthy consumer has low values for both. High num_pending with low num_ack_pending means the consumer isn’t pulling fast enough. High num_ack_pending with high num_pending means the consumer is both slow to process and falling behind.

Proactive monitoring for NATS unprocessed critical with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel