A redelivery critical alert fires when a JetStream consumer’s num_redelivered counter exceeds the operator-defined threshold set via the io.nats.monitor.redelivery-critical metadata key. This counter tracks how many messages have been delivered more than once — meaning the original delivery was not acknowledged within the ack_wait window, or the consumer explicitly nacked the message. A high redelivery count signals that the consumer is failing to process messages reliably, wasting server resources on repeated delivery attempts and potentially causing duplicate processing in downstream systems.
Every redelivered message represents wasted work. The server must re-read the message from storage, route it through the consumer’s delivery path, and track it in the ack pending set — all for a message that was already delivered at least once. At scale, excessive redelivery can consume a significant fraction of server I/O and network bandwidth, degrading performance for all consumers on the same stream.
But the resource waste is secondary to the correctness problem. Unless your processing pipeline is fully idempotent, redelivered messages risk being processed multiple times. An order might be charged twice. A notification might be sent repeatedly. A state machine might process the same transition multiple times, leaving the system in an inconsistent state. Even systems designed for at-least-once delivery assume that redelivery is an exceptional event, not the common path.
Redelivery also creates a feedback loop. Each redelivered message occupies a slot in the max_ack_pending window. If redeliveries accumulate faster than they’re resolved, the effective throughput of the consumer drops — fewer slots are available for new messages because redelivered messages keep cycling through the pending set. In severe cases, the consumer spends all its capacity redelivering the same batch of failing messages while new messages queue up in the stream untouched.
The redelivery-critical threshold lets operators define what constitutes an unacceptable redelivery level for each consumer. A consumer processing financial transactions might set a threshold of 10. A best-effort metrics consumer might tolerate 1,000. The threshold is operator-defined because only the operator knows the impact of redelivery for their specific workload.
Persistent processing failures. The most common cause. The consumer receives a message, attempts to process it, encounters an error (invalid data, downstream service error, resource exhaustion), and either nacks the message or lets the ack_wait expire. The server redelivers, the same error occurs, and the cycle repeats.
ack_wait too short for processing time. The consumer successfully processes messages, but processing takes longer than the configured ack_wait. The server assumes the message was lost and redelivers it before the consumer finishes. The consumer then acks the original delivery (which the server may ignore as stale) and receives a duplicate.
Consumer application restarts during processing. When a consumer disconnects mid-processing — due to a crash, deployment, or scaling event — all unacknowledged messages in its pending set are redelivered to other subscribers (or to the same subscriber on reconnect). Frequent restarts during active processing generate a steady stream of redeliveries.
Poison messages. A message that consistently crashes or errors the consumer on every processing attempt. Each delivery fails, triggers a redelivery, and fails again. Without a max_deliver limit, the poison message cycles indefinitely, consuming consumer capacity and inflating the redelivery counter.
Slow consumer causing ack_wait timeouts. The consumer is processing messages but too slowly. By the time it finishes one batch, the next batch has already exceeded ack_wait and been redelivered. The consumer processes both the original and the redelivery, doubling its workload and falling further behind.
Client library reconnection behavior. Some client library versions or configurations may not properly track in-flight messages across reconnections, leading to messages being redelivered unnecessarily after a brief network disruption.
nats consumer info STREAM_NAME CONSUMER_NAMELook at Redelivered Messages in the output. Compare this to the total delivered count to understand the redelivery ratio. A ratio above 10% usually indicates a systemic issue.
# Watch the consumer stats, refreshing every 5 secondswatch -n 5 'nats consumer info STREAM_NAME CONSUMER_NAME -j | \ jq "{delivered: .delivered.consumer_seq, redelivered: .num_redelivered, ack_pending: .num_ack_pending}"'If num_redelivered is climbing while delivered is relatively flat, the consumer is mostly processing redeliveries rather than new messages.
Check if specific messages are being redelivered repeatedly. If your consumer application logs include the stream sequence number on processing failures, look for sequences that appear multiple times:
# Check the consumer's pending messagesnats consumer next STREAM_NAME CONSUMER_NAME --count 1 --no-ackInspect msg.Metadata().NumDelivered on each fetched JetStream message (the count is parsed from the ack reply subject; it is not exposed as a header). A value significantly above 1 indicates the message has been redelivered multiple times.
nats consumer info STREAM_NAME CONSUMER_NAME -j | jq '.config.ack_wait'If your application’s p99 processing time approaches or exceeds this value, ack_wait timeouts are likely driving redeliveries.
1// Go: check redelivery rate and identify trends2js, _ := nc.JetStream()3ci, _ := js.ConsumerInfo("STREAM_NAME", "CONSUMER_NAME")4
5redeliveryRatio := float64(ci.NumRedelivered) / float64(ci.Delivered.Consumer)6fmt.Printf("Consumer: %s\n", ci.Name)7fmt.Printf(" Total Delivered: %d\n", ci.Delivered.Consumer)8fmt.Printf(" Redelivered: %d (%.1f%%)\n", ci.NumRedelivered, redeliveryRatio*100)9fmt.Printf(" Ack Pending: %d\n", ci.NumAckPending)10fmt.Printf(" Ack Wait: %s\n", ci.Config.AckWait)11
12if ci.NumRedelivered > 1000 {13 fmt.Println(" ⚠ CRITICAL: redelivery count exceeds threshold")14}1# Python: check redelivery stats2import nats3
4nc = await nats.connect()5js = nc.jetstream()6
7ci = await js.consumer_info("STREAM_NAME", "CONSUMER_NAME")8total = ci.delivered.consumer_seq9redelivered = ci.num_redelivered10ratio = (redelivered / total * 100) if total > 0 else 011
12print(f"Consumer: {ci.name}")13print(f" Total Delivered: {total}")14print(f" Redelivered: {redelivered} ({ratio:.1f}%)")15print(f" Ack Pending: {ci.num_ack_pending}")16
17if redelivered > 1000:18 print(" ⚠ CRITICAL: redelivery count exceeds threshold")Set max_deliver to limit redelivery attempts. Without a limit, poison messages cycle forever. Setting max_deliver caps the number of delivery attempts per message:
nats consumer edit STREAM_NAME CONSUMER_NAME --max-deliver 5Messages that exceed max_deliver are dropped from the consumer’s perspective. Pair this with a dead-letter strategy (see below) to avoid losing data.
Increase ack_wait if processing time is the bottleneck:
nats consumer edit STREAM_NAME CONSUMER_NAME --wait 5mUse in-progress signals for long-running processing to prevent ack_wait timeouts without increasing the global timeout:
1// Reset ack_wait timer periodically during processing2go func() {3 ticker := time.NewTicker(15 * time.Second)4 defer ticker.Stop()5 for range ticker.C {6 msg.InProgress()7 }8}()Fix the root cause of nacks and processing errors. Examine your application logs for the errors that trigger nacks. Common fixes include:
msg.Term() instead of nack)Implement a dead-letter pattern for poison messages. When a message exceeds a delivery threshold, route it to a separate stream for manual inspection rather than continuing to redeliver:
1sub, _ := js.Subscribe("orders.>", func(msg *nats.Msg) {2 md, _ := msg.Metadata()3 deliveryCount := int(md.NumDelivered)4
5 if deliveryCount > 3 {6 // Dead-letter: publish to DLQ stream and terminate7 nc.Publish("dlq.orders", msg.Data)8 msg.Term()9 return10 }11
12 if err := processOrder(msg.Data); err != nil {13 msg.NakWithDelay(time.Duration(deliveryCount) * 30 * time.Second)14 return15 }16 msg.Ack()17})Use nack with delay (backoff) to space out retries rather than immediate redelivery, which tends to fail again instantly for the same reason:
# Consumer-level backoff configurationnats consumer add STREAM_NAME CONSUMER_NAME \ --backoff linear \ --backoff-steps 3 \ --backoff-min 10s \ --backoff-max 5mMake all message processing idempotent. Since at-least-once delivery means redelivery can always happen (even at low rates), your processing pipeline must handle duplicates correctly. Use message deduplication keys, database upserts, or idempotency tokens.
Monitor redelivery rates as a first-class SLI. Track the redelivery ratio over time and alert when it exceeds acceptable levels.
Use Synadia Insights for fleet-wide redelivery monitoring. Insights evaluates the redelivery-critical threshold across all consumers, providing a single view of redelivery health across your entire deployment without manual per-consumer configuration.
Set it as consumer metadata:
nats consumer add STREAM_NAME CONSUMER_NAME \ --metadata "io.nats.monitor.redelivery-critical=500"The value is an integer representing the maximum acceptable num_redelivered count. Choose based on your workload’s tolerance for duplicate processing and the consumer’s normal redelivery baseline.
High consumer redelivery (CONSUMER_002) is a warning-level check that uses heuristics or relative thresholds to flag elevated redelivery rates. Redelivery critical (CONSUMER_011) uses an explicit operator-defined absolute threshold, making it a hard limit that the operator has determined represents a critical problem for that specific consumer.
Yes. The num_redelivered counter is part of the consumer’s state and resets to zero when the consumer is deleted and recreated. It does not reset on consumer restart or server restart — only on consumer deletion.
Yes. Use msg.InProgress() to signal that the consumer is actively processing the message. This resets the ack_wait timer for that specific message without changing the consumer-wide setting. This is ideal for messages that occasionally take longer than average to process.
Messages that have been delivered max_deliver times without acknowledgment are dropped from the consumer — the consumer’s ack floor advances past them. The messages remain in the stream and are accessible to other consumers, but this consumer will not attempt to deliver them again. To preserve these messages, implement a dead-letter pattern that captures them before they’re terminated.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community