Checks/CONSUMER_009

NATS Last Delivery Critical: Detecting Stalled Consumer Delivery

Severity
Critical
Category
Health
Applies to
Consumer
Check ID
CONSUMER_009
Detection threshold
Time since last delivery exceeds operator-defined io.nats.monitor.last-delivery-critical threshold

A last delivery critical alert fires when a JetStream consumer has not delivered a message to any of its subscribers within the operator-defined time window. This threshold is set via the io.nats.monitor.last-delivery-critical metadata key on the consumer configuration. When the elapsed time since the last delivery exceeds this value, the consumer is considered stalled from a delivery perspective — it may be paused, disconnected, or blocked, even though the stream it reads from may still be actively receiving messages.

Why this matters

A consumer that stops delivering messages is a silent failure. Unlike a crash that generates log entries or a connection drop that triggers reconnect logic, a delivery stall can persist indefinitely without any obvious signal. Messages accumulate in the stream, the consumer’s pending count grows, and downstream systems operate on increasingly stale data — or no data at all.

The danger scales with how critical the consumer’s workload is. An order processing consumer that stops delivering means orders pile up unprocessed. A metrics aggregation consumer that stalls means dashboards go dark and alerting gaps appear. In event-driven architectures where consumers feed other consumers, a single delivery stall can cascade through the entire pipeline.

The last-delivery-critical threshold exists precisely for this scenario. By defining an acceptable delivery window — say, 5 minutes for a consumer that normally delivers every second — operators get an early warning that something has gone wrong before the business impact becomes severe. Without this check, delivery stalls are often discovered only when an end user reports missing data, which can be hours or days later.

In multi-tenant or large-scale deployments, manual monitoring of every consumer is impractical. This check automates the detection of delivery gaps across hundreds or thousands of consumers, ensuring that no stalled consumer goes unnoticed regardless of deployment size.

Common causes

  • Consumer is paused. JetStream consumers can be explicitly paused via the API or CLI. A paused consumer stops pulling messages from the stream entirely. This is sometimes intentional (maintenance windows, deployment coordination) but becomes a problem when the pause is forgotten or the unpause mechanism fails.

  • No active subscribers on a pull consumer. Pull consumers only deliver messages when a client issues a fetch or pull request. If all client instances have crashed, been scaled to zero, or disconnected, the consumer has no one to deliver to. The consumer itself is healthy — it simply has no subscribers asking for messages.

  • Stream has stopped receiving messages. If the upstream publishers have stopped publishing to the subjects this consumer is bound to, there are no new messages to deliver. This is a legitimate scenario but may indicate an upstream failure rather than a consumer issue.

  • Consumer quorum lost. In a replicated JetStream environment, the consumer’s Raft group may have lost quorum. Without a functioning leader, the consumer cannot process or deliver messages. This typically correlates with server outages or network partitions.

  • Backoff from repeated delivery failures. Some consumer configurations use backoff policies that progressively delay redelivery after repeated failures. If a consumer’s messages are consistently failing processing (being nacked or timing out), the backoff delay can grow large enough to exceed the critical threshold.

  • Filter subject no longer matches any messages. The consumer was created with a filter subject that previously matched incoming messages, but the publishing pattern has changed. The consumer is technically healthy and ready to deliver, but no messages match its filter.

  • Max ack pending limit reached. If the consumer has hit its max_ack_pending ceiling and all outstanding messages are awaiting acknowledgment, the server will not deliver additional messages until some are acked. If the processing pipeline is completely stalled, no new deliveries occur.

How to diagnose

Check consumer delivery status

Start by inspecting the consumer’s current state:

Terminal window
nats consumer info STREAM_NAME CONSUMER_NAME

Look at the Last Delivery timestamp in the output. Compare it to the current time. If the gap exceeds your configured threshold, the alert is confirmed. Also check Num Ack Pending, Num Pending, and whether the consumer shows as paused.

Inspect the consumer metadata threshold

Verify what threshold is configured:

Terminal window
nats consumer info STREAM_NAME CONSUMER_NAME -j | jq '.config.metadata'

The io.nats.monitor.last-delivery-critical value should be a Go duration string like 5m or 1h. If the threshold is too tight for the consumer’s normal delivery cadence, you may be seeing false positives.

Check for active subscribers

For pull consumers, verify that clients are actively fetching:

Terminal window
nats consumer info STREAM_NAME CONSUMER_NAME -j | jq '.num_waiting'

A num_waiting of zero on a pull consumer means no clients are currently waiting for messages. This is the most common cause of delivery stalls on pull consumers.

Check if the consumer is paused

Terminal window
nats consumer info STREAM_NAME CONSUMER_NAME -j | jq '.paused'

If the consumer is paused, delivery is intentionally halted. Verify whether this is expected.

Verify stream is receiving messages

Terminal window
nats stream info STREAM_NAME

Check the Last Message timestamp. If the stream itself has not received new messages recently, the consumer has nothing to deliver. Investigate the publisher side.

Programmatic diagnosis

1
// Go: check last delivery age
2
js, _ := nc.JetStream()
3
ci, _ := js.ConsumerInfo("STREAM_NAME", "CONSUMER_NAME")
4
5
lastDelivery := time.Since(ci.Delivered.Last)
6
threshold := 5 * time.Minute
7
8
if lastDelivery > threshold {
9
fmt.Printf("CRITICAL: consumer %s last delivery was %s ago\n",
10
ci.Name, lastDelivery.Round(time.Second))
11
fmt.Printf(" Num Pending: %d\n", ci.NumPending)
12
fmt.Printf(" Num Ack Pending: %d\n", ci.NumAckPending)
13
fmt.Printf(" Num Waiting: %d\n", ci.NumWaiting)
14
}
1
# Python: check last delivery age
2
import nats
3
from datetime import datetime, timezone
4
5
nc = await nats.connect()
6
js = nc.jetstream()
7
8
ci = await js.consumer_info("STREAM_NAME", "CONSUMER_NAME")
9
last_delivery = ci.delivered.last
10
now = datetime.now(timezone.utc)
11
age = (now - last_delivery).total_seconds()
12
13
if age > 300: # 5 minutes
14
print(f"CRITICAL: consumer {ci.name} last delivery was {age:.0f}s ago")
15
print(f" Num Pending: {ci.num_pending}")
16
print(f" Num Ack Pending: {ci.num_ack_pending}")
17
print(f" Num Waiting: {ci.num_waiting}")

How to fix it

Immediate: restore delivery

If the consumer is paused, resume it:

Terminal window
nats consumer resume STREAM_NAME CONSUMER_NAME

If no subscribers are connected to a pull consumer, restart or scale up the consuming application. The consumer cannot deliver if nobody is fetching. Check your deployment orchestrator (Kubernetes, systemd, etc.) for crashed or scaled-down instances.

If max ack pending is blocking delivery, investigate why existing messages aren’t being acknowledged. The processing pipeline downstream may be stalled:

Terminal window
# Check outstanding acks
nats consumer info STREAM_NAME CONSUMER_NAME -j | jq '.num_ack_pending'

If messages are stuck, you may need to restart the consumer application or — as a last resort — nack the pending messages to trigger redelivery.

Short-term: prevent recurrence

Ensure consumer applications have health checks and auto-restart logic. For Kubernetes deployments, use liveness probes that verify the application is actively fetching from the consumer. A process that’s running but not pulling messages is effectively dead.

Set appropriate max_ack_pending limits. If the limit is too low relative to processing throughput, the consumer stalls under normal load. If it’s too high, a processing failure blocks a large number of messages. Tune based on your observed processing rate and acceptable latency.

Adjust the threshold if it’s too aggressive. If the consumer legitimately has quiet periods where no messages arrive, the last-delivery-critical threshold needs to account for that. Set it to a value that reflects genuine anomaly rather than normal traffic patterns.

Long-term: build observability

Use Synadia Insights for automated monitoring. Insights evaluates the last-delivery-critical threshold across all consumers in your deployment every collection epoch, removing the need to manually configure per-consumer alerting rules.

Implement consumer heartbeat patterns. For consumers with irregular message flow, publish periodic heartbeat messages to the stream’s subjects. This ensures the consumer always has something to deliver, making delivery stalls unambiguous indicators of a real problem.

Frequently asked questions

How do I set the last-delivery-critical threshold?

Set it as consumer metadata when creating or updating the consumer:

Terminal window
nats consumer add STREAM_NAME CONSUMER_NAME \
--metadata "io.nats.monitor.last-delivery-critical=5m"

The value is a Go duration string. Choose a threshold that’s comfortably above your normal maximum inter-message gap but short enough to catch real problems quickly.

What if the stream simply has no new messages?

This is the most common false positive. If the stream’s publishing pattern includes long quiet periods, either increase the threshold to accommodate them or use heartbeat messages to maintain a baseline delivery cadence. Alternatively, pair this check with stream activity monitoring — if the stream itself is inactive, the consumer’s delivery stall is expected.

Does this check apply to push consumers?

Yes. Push consumers deliver messages to subscribers as they arrive. If the push consumer’s subscriber disconnects or the consumer loses its leader, delivery stops. The threshold applies equally to push and pull consumers.

What’s the difference between last delivery critical and inactive consumer?

The inactive consumer check (CONSUMER_003) flags consumers with no activity over a broader time horizon, often indicating the consumer should be cleaned up. Last delivery critical is a tighter operational check — the consumer is expected to be active, and the delivery gap indicates a problem that needs immediate attention.

Can a consumer show last delivery critical if it’s caught up?

Yes. If the consumer has processed all available messages and is waiting for new ones, the last delivery timestamp reflects when the last message was delivered. If no new messages arrive for longer than the threshold, the check fires. This is why the threshold must account for the stream’s normal publishing cadence.

Proactive monitoring for NATS last delivery critical with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel