Checks/CONSUMER_011

NATS Redelivery Critical: Diagnosing Excessive Message Redelivery

Severity
Critical
Category
Errors
Applies to
Consumer
Check ID
CONSUMER_011
Detection threshold
num_redelivered exceeds operator-defined io.nats.monitor.redelivery-critical threshold

A redelivery critical alert fires when a JetStream consumer’s num_redelivered counter exceeds the operator-defined threshold set via the io.nats.monitor.redelivery-critical metadata key. This counter tracks how many messages have been delivered more than once — meaning the original delivery was not acknowledged within the ack_wait window, or the consumer explicitly nacked the message. A high redelivery count signals that the consumer is failing to process messages reliably, wasting server resources on repeated delivery attempts and potentially causing duplicate processing in downstream systems.

Why this matters

Every redelivered message represents wasted work. The server must re-read the message from storage, route it through the consumer’s delivery path, and track it in the ack pending set — all for a message that was already delivered at least once. At scale, excessive redelivery can consume a significant fraction of server I/O and network bandwidth, degrading performance for all consumers on the same stream.

But the resource waste is secondary to the correctness problem. Unless your processing pipeline is fully idempotent, redelivered messages risk being processed multiple times. An order might be charged twice. A notification might be sent repeatedly. A state machine might process the same transition multiple times, leaving the system in an inconsistent state. Even systems designed for at-least-once delivery assume that redelivery is an exceptional event, not the common path.

Redelivery also creates a feedback loop. Each redelivered message occupies a slot in the max_ack_pending window. If redeliveries accumulate faster than they’re resolved, the effective throughput of the consumer drops — fewer slots are available for new messages because redelivered messages keep cycling through the pending set. In severe cases, the consumer spends all its capacity redelivering the same batch of failing messages while new messages queue up in the stream untouched.

The redelivery-critical threshold lets operators define what constitutes an unacceptable redelivery level for each consumer. A consumer processing financial transactions might set a threshold of 10. A best-effort metrics consumer might tolerate 1,000. The threshold is operator-defined because only the operator knows the impact of redelivery for their specific workload.

Common causes

  • Persistent processing failures. The most common cause. The consumer receives a message, attempts to process it, encounters an error (invalid data, downstream service error, resource exhaustion), and either nacks the message or lets the ack_wait expire. The server redelivers, the same error occurs, and the cycle repeats.

  • ack_wait too short for processing time. The consumer successfully processes messages, but processing takes longer than the configured ack_wait. The server assumes the message was lost and redelivers it before the consumer finishes. The consumer then acks the original delivery (which the server may ignore as stale) and receives a duplicate.

  • Consumer application restarts during processing. When a consumer disconnects mid-processing — due to a crash, deployment, or scaling event — all unacknowledged messages in its pending set are redelivered to other subscribers (or to the same subscriber on reconnect). Frequent restarts during active processing generate a steady stream of redeliveries.

  • Poison messages. A message that consistently crashes or errors the consumer on every processing attempt. Each delivery fails, triggers a redelivery, and fails again. Without a max_deliver limit, the poison message cycles indefinitely, consuming consumer capacity and inflating the redelivery counter.

  • Slow consumer causing ack_wait timeouts. The consumer is processing messages but too slowly. By the time it finishes one batch, the next batch has already exceeded ack_wait and been redelivered. The consumer processes both the original and the redelivery, doubling its workload and falling further behind.

  • Client library reconnection behavior. Some client library versions or configurations may not properly track in-flight messages across reconnections, leading to messages being redelivered unnecessarily after a brief network disruption.

How to diagnose

Check the redelivery count

Terminal window
nats consumer info STREAM_NAME CONSUMER_NAME

Look at Redelivered Messages in the output. Compare this to the total delivered count to understand the redelivery ratio. A ratio above 10% usually indicates a systemic issue.

Monitor redelivery rate over time

Terminal window
# Watch the consumer stats, refreshing every 5 seconds
watch -n 5 'nats consumer info STREAM_NAME CONSUMER_NAME -j | \
jq "{delivered: .delivered.consumer_seq, redelivered: .num_redelivered, ack_pending: .num_ack_pending}"'

If num_redelivered is climbing while delivered is relatively flat, the consumer is mostly processing redeliveries rather than new messages.

Identify poison messages

Check if specific messages are being redelivered repeatedly. If your consumer application logs include the stream sequence number on processing failures, look for sequences that appear multiple times:

Terminal window
# Check the consumer's pending messages
nats consumer next STREAM_NAME CONSUMER_NAME --count 1 --no-ack

Inspect msg.Metadata().NumDelivered on each fetched JetStream message (the count is parsed from the ack reply subject; it is not exposed as a header). A value significantly above 1 indicates the message has been redelivered multiple times.

Compare ack_wait with processing time

Terminal window
nats consumer info STREAM_NAME CONSUMER_NAME -j | jq '.config.ack_wait'

If your application’s p99 processing time approaches or exceeds this value, ack_wait timeouts are likely driving redeliveries.

Programmatic diagnosis

1
// Go: check redelivery rate and identify trends
2
js, _ := nc.JetStream()
3
ci, _ := js.ConsumerInfo("STREAM_NAME", "CONSUMER_NAME")
4
5
redeliveryRatio := float64(ci.NumRedelivered) / float64(ci.Delivered.Consumer)
6
fmt.Printf("Consumer: %s\n", ci.Name)
7
fmt.Printf(" Total Delivered: %d\n", ci.Delivered.Consumer)
8
fmt.Printf(" Redelivered: %d (%.1f%%)\n", ci.NumRedelivered, redeliveryRatio*100)
9
fmt.Printf(" Ack Pending: %d\n", ci.NumAckPending)
10
fmt.Printf(" Ack Wait: %s\n", ci.Config.AckWait)
11
12
if ci.NumRedelivered > 1000 {
13
fmt.Println(" ⚠ CRITICAL: redelivery count exceeds threshold")
14
}
1
# Python: check redelivery stats
2
import nats
3
4
nc = await nats.connect()
5
js = nc.jetstream()
6
7
ci = await js.consumer_info("STREAM_NAME", "CONSUMER_NAME")
8
total = ci.delivered.consumer_seq
9
redelivered = ci.num_redelivered
10
ratio = (redelivered / total * 100) if total > 0 else 0
11
12
print(f"Consumer: {ci.name}")
13
print(f" Total Delivered: {total}")
14
print(f" Redelivered: {redelivered} ({ratio:.1f}%)")
15
print(f" Ack Pending: {ci.num_ack_pending}")
16
17
if redelivered > 1000:
18
print(" ⚠ CRITICAL: redelivery count exceeds threshold")

How to fix it

Immediate: stop the redelivery spiral

Set max_deliver to limit redelivery attempts. Without a limit, poison messages cycle forever. Setting max_deliver caps the number of delivery attempts per message:

Terminal window
nats consumer edit STREAM_NAME CONSUMER_NAME --max-deliver 5

Messages that exceed max_deliver are dropped from the consumer’s perspective. Pair this with a dead-letter strategy (see below) to avoid losing data.

Increase ack_wait if processing time is the bottleneck:

Terminal window
nats consumer edit STREAM_NAME CONSUMER_NAME --wait 5m

Use in-progress signals for long-running processing to prevent ack_wait timeouts without increasing the global timeout:

1
// Reset ack_wait timer periodically during processing
2
go func() {
3
ticker := time.NewTicker(15 * time.Second)
4
defer ticker.Stop()
5
for range ticker.C {
6
msg.InProgress()
7
}
8
}()

Short-term: fix processing failures

Fix the root cause of nacks and processing errors. Examine your application logs for the errors that trigger nacks. Common fixes include:

  • Handling malformed messages gracefully (parse, log, msg.Term() instead of nack)
  • Adding retry logic with backoff for transient downstream failures
  • Increasing resource limits (connection pool size, memory, file descriptors)

Implement a dead-letter pattern for poison messages. When a message exceeds a delivery threshold, route it to a separate stream for manual inspection rather than continuing to redeliver:

1
sub, _ := js.Subscribe("orders.>", func(msg *nats.Msg) {
2
md, _ := msg.Metadata()
3
deliveryCount := int(md.NumDelivered)
4
5
if deliveryCount > 3 {
6
// Dead-letter: publish to DLQ stream and terminate
7
nc.Publish("dlq.orders", msg.Data)
8
msg.Term()
9
return
10
}
11
12
if err := processOrder(msg.Data); err != nil {
13
msg.NakWithDelay(time.Duration(deliveryCount) * 30 * time.Second)
14
return
15
}
16
msg.Ack()
17
})

Use nack with delay (backoff) to space out retries rather than immediate redelivery, which tends to fail again instantly for the same reason:

Terminal window
# Consumer-level backoff configuration
nats consumer add STREAM_NAME CONSUMER_NAME \
--backoff linear \
--backoff-steps 3 \
--backoff-min 10s \
--backoff-max 5m

Long-term: design for idempotency

Make all message processing idempotent. Since at-least-once delivery means redelivery can always happen (even at low rates), your processing pipeline must handle duplicates correctly. Use message deduplication keys, database upserts, or idempotency tokens.

Monitor redelivery rates as a first-class SLI. Track the redelivery ratio over time and alert when it exceeds acceptable levels.

Use Synadia Insights for fleet-wide redelivery monitoring. Insights evaluates the redelivery-critical threshold across all consumers, providing a single view of redelivery health across your entire deployment without manual per-consumer configuration.

Frequently asked questions

How do I set the redelivery-critical threshold?

Set it as consumer metadata:

Terminal window
nats consumer add STREAM_NAME CONSUMER_NAME \
--metadata "io.nats.monitor.redelivery-critical=500"

The value is an integer representing the maximum acceptable num_redelivered count. Choose based on your workload’s tolerance for duplicate processing and the consumer’s normal redelivery baseline.

What’s the difference between redelivery critical and high consumer redelivery?

High consumer redelivery (CONSUMER_002) is a warning-level check that uses heuristics or relative thresholds to flag elevated redelivery rates. Redelivery critical (CONSUMER_011) uses an explicit operator-defined absolute threshold, making it a hard limit that the operator has determined represents a critical problem for that specific consumer.

Does num_redelivered reset when the consumer is recreated?

Yes. The num_redelivered counter is part of the consumer’s state and resets to zero when the consumer is deleted and recreated. It does not reset on consumer restart or server restart — only on consumer deletion.

Can I reduce redeliveries without changing ack_wait?

Yes. Use msg.InProgress() to signal that the consumer is actively processing the message. This resets the ack_wait timer for that specific message without changing the consumer-wide setting. This is ideal for messages that occasionally take longer than average to process.

What happens to messages that exceed max_deliver?

Messages that have been delivered max_deliver times without acknowledgment are dropped from the consumer — the consumer’s ack floor advances past them. The messages remain in the stream and are accessible to other consumers, but this consumer will not attempt to deliver them again. To preserve these messages, implement a dead-letter pattern that captures them before they’re terminated.

Proactive monitoring for NATS redelivery critical with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel