Checks/OPT_SYS_002

NATS High Consumer Redelivery: What It Means and How to Fix It

Severity
Warning
Category
Errors
Applies to
System Improvement
Check ID
OPT_SYS_002
Detection threshold
Redelivery rate exceeds 10% of delivered messages

A high consumer redelivery rate means a JetStream consumer is receiving the same messages repeatedly because they are not being acknowledged within the configured ack_wait window. When redeliveries exceed a significant percentage of total deliveries, it signals that the consuming application is consistently failing to process messages — either crashing, timing out, or encountering errors that prevent acknowledgment.

Why this matters

Every redelivered message represents wasted work. The server sends the message, the consumer (possibly) processes it partially, and then the cycle repeats. At a 10% redelivery rate on a consumer handling 100,000 messages per hour, that’s 10,000 extra deliveries competing for the same processing resources as new messages. The consumer falls further behind, redeliveries compound, and throughput degrades.

The deeper problem is what redeliveries indicate: something in the processing pipeline is broken. Messages might be hitting an unhandled error path that causes the consumer to crash before acknowledging. A downstream dependency — a database, an API, a queue — might be intermittently unavailable, causing timeouts that exceed ack_wait. Or a subset of messages might be malformed or trigger edge cases that the consumer cannot handle, creating “poison messages” that cycle through redelivery indefinitely until they hit max_deliver.

Left unchecked, high redelivery creates a feedback loop. Redelivered messages consume processing capacity that could handle new messages, pushing the consumer further behind. Ack pending counts climb toward the limit (see OPT_SYS_003). If max_deliver is not configured, poison messages cycle forever. If it is configured without an advisory handler, those messages silently disappear — acknowledged as processed when they never were.

Common causes

  • ack_wait shorter than actual processing time. The default ack_wait is 30 seconds. If message processing involves database writes, HTTP calls, or computation that occasionally exceeds this window, the server redelivers the message while the consumer is still working on it. The consumer then processes the same message twice — once from the original delivery and once from the redelivery.

  • Consumer crashes during processing. If the consumer process crashes or restarts after receiving a message but before acknowledging it, every in-flight message at crash time will be redelivered after ack_wait expires. Frequent restarts (OOM kills, unhandled exceptions, deployment churn) cause sustained high redelivery rates.

  • Downstream dependency failures. The consumer successfully receives and parses the message but cannot complete processing because a database is unreachable, an API returns errors, or a downstream queue is full. Without explicit error handling that uses Nak or Term, the message sits unacknowledged until ack_wait expires.

  • Poison messages. A subset of messages triggers a bug — a nil pointer on an unexpected field, a schema violation, a payload too large for a downstream system. These messages fail on every delivery attempt, consuming redelivery budget until max_deliver is reached. Without a dead letter strategy, they either cycle indefinitely or are silently dropped.

  • Duplicate processing from slow Ack. The consumer processes the message successfully and sends an Ack, but the ack arrives at the server after ack_wait has already triggered redelivery. This is common when the ack is sent after a long processing chain rather than using InProgress to extend the deadline.

How to diagnose

Check redelivery rates per consumer

Use the consumer report to see redelivery counts across all consumers on a stream:

Terminal window
nats consumer report <stream_name>

Look for the Redelivered column. Compare it to the Delivered column to calculate the redelivery percentage.

For detailed stats on a specific consumer:

Terminal window
nats consumer info <stream_name> <consumer_name>

Key fields:

  • Redelivered — Total number of redelivered messages
  • Ack Pending — Messages delivered but not yet acknowledged
  • Ack Floor — Last contiguous acknowledged sequence

Monitor redelivery advisories in real time

NATS emits an advisory when a message hits max_deliver:

Terminal window
nats event --js-advisory

Watch for io.nats.jetstream.advisory.v1.max_deliver events. These identify the specific stream, consumer, and stream sequence of messages that exhausted their delivery attempts.

Identify whether the problem is time-based or error-based

If redeliveries correlate with processing latency spikes, the issue is likely ack_wait timeout. Check your consumer’s processing time distribution — if P99 latency exceeds ack_wait, you’ll see redeliveries on the slowest messages.

If redeliveries cluster around specific message sequences that repeat at regular ack_wait intervals, suspect poison messages. Cross-reference the redelivered sequence numbers with your application logs to find the failing messages.

Check consumer configuration

Terminal window
nats consumer info <stream_name> <consumer_name> --json | jq '.config | {ack_wait, max_deliver, max_ack_pending}'

Verify that ack_wait is appropriate for your processing time and that max_deliver is set to prevent infinite redelivery loops.

How to fix it

Immediate: stop the redelivery loop

Set max_deliver to cap retry attempts and prevent infinite redelivery loops. Without max_deliver, poison messages cycle through redelivery forever. Configure it alongside a dead letter advisory handler:

Terminal window
nats consumer edit <stream_name> <consumer_name> --max-deliver=5

Configure backoff for exponential retry spacing. Instead of redelivering at a fixed ack_wait interval, use backoff to space retries exponentially — this gives downstream dependencies time to recover and reduces the processing load from redeliveries.

Use InProgress to extend the ack deadline for long-running work. If processing legitimately takes longer than ack_wait (default 30s), signal the server that work is ongoing:

1
// Go client (nats.go)
2
sub, _ := js.PullSubscribe("ORDERS.>", "order-processor")
3
msgs, _ := sub.Fetch(10)
4
for _, msg := range msgs {
5
// Signal work in progress every 10 seconds
6
go func(m *nats.Msg) {
7
ticker := time.NewTicker(10 * time.Second)
8
defer ticker.Stop()
9
for range ticker.C {
10
_ = m.InProgress()
11
}
12
}(msg)
13
14
processOrder(msg)
15
_ = msg.Ack()
16
}
1
# Python (nats.py)
2
import asyncio
3
import nats
4
from nats.js.api import ConsumerConfig
5
6
nc = await nats.connect()
7
js = nc.jetstream()
8
sub = await js.pull_subscribe("ORDERS.>", "order-processor")
9
msgs = await sub.fetch(10)
10
for msg in msgs:
11
# Extend deadline during long processing
12
task = asyncio.create_task(keep_alive(msg))
13
await process_order(msg)
14
task.cancel()
15
await msg.ack()
16
17
async def keep_alive(msg):
18
while True:
19
await asyncio.sleep(10)
20
await msg.in_progress()

Terminate poison messages explicitly. If a message cannot be processed, use Term to tell the server to stop redelivering it:

1
if err := processMessage(msg); err != nil {
2
if isPermanentError(err) {
3
_ = msg.Term() // Stop redelivering this message
4
} else {
5
_ = msg.NakWithDelay(5 * time.Second) // Retry after backoff
6
}
7
}

Short-term: fix the processing pipeline

Set max_deliver and handle the advisory. Configure a maximum delivery count so poison messages don’t cycle forever. Then consume the max delivery advisory to route failed messages to a dead letter stream:

1
// Create consumer with max_deliver
2
_, err := js.AddConsumer("ORDERS", &nats.ConsumerConfig{
3
Durable: "order-processor",
4
AckPolicy: nats.AckExplicitPolicy,
5
AckWait: 60 * time.Second,
6
MaxDeliver: 5,
7
MaxAckPending: 1000,
8
FilterSubject: "ORDERS.>",
9
})
10
11
// Handle dead letters via advisory
12
nc.Subscribe("$JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES.ORDERS.order-processor",
13
func(msg *nats.Msg) {
14
// Parse advisory, publish message details to dead letter stream
15
js.Publish("DEAD_LETTERS.orders", msg.Data)
16
},
17
)

Increase ack_wait if processing legitimately takes longer. The default ack_wait is 30 seconds. If your P99 processing time is 45 seconds, increase it:

Terminal window
nats consumer edit <stream_name> <consumer_name> --wait=90s

Set ack_wait to at least 2x your P99 processing time to account for variance. Common causes of redelivery are processing time exceeding ack_wait, application panics before acknowledging, or incorrect ack logic (acking the wrong message).

Long-term: design for reliable processing

Separate message receipt from processing. Acknowledge the message once it’s durably enqueued in your internal processing pipeline, not after the entire processing chain completes. This decouples NATS delivery semantics from downstream processing reliability.

Implement idempotent processing. Since redeliveries mean a message may be processed more than once, ensure your processing logic handles duplicates safely. Use the message’s stream sequence or a domain-specific deduplication key.

Add structured error handling with Nak backoff. Instead of letting messages time out silently, use NakWithDelay with exponential backoff for transient errors:

1
func handleMessage(msg *nats.Msg) {
2
md, _ := msg.Metadata()
3
attempt := md.NumDelivered
4
err := processMessage(msg)
5
if err == nil {
6
msg.Ack()
7
return
8
}
9
if isPermanent(err) {
10
msg.Term()
11
return
12
}
13
delay := time.Duration(math.Pow(2, float64(attempt))) * time.Second
14
msg.NakWithDelay(delay)
15
}

Frequently asked questions

What is a normal redelivery rate for NATS JetStream consumers?

A healthy consumer should have a redelivery rate well below 1%. Occasional redeliveries during deployments or transient failures are expected, but sustained rates above 5-10% indicate a systemic problem. Synadia Insights flags consumers whose redelivered messages exceed 10% of delivered messages.

How do I find which specific messages are being redelivered?

Check msg.Metadata().NumDelivered on JetStream messages — any value greater than 1 indicates a redelivery (the count is parsed from the JetStream ack reply subject; there is no Nats-Num-Delivered header). For messages that exhaust max_deliver, subscribe to $JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES.<stream>.<consumer> to get the stream sequence number and metadata for each failed message. Cross-reference these sequences with your application logs.

What is the difference between Nak and Term in NATS JetStream?

Nak tells the server to redeliver the message (optionally after a delay with NakWithDelay). Use it for transient errors where retry might succeed. Term tells the server to permanently stop delivering the message — it counts as acknowledged and won’t be redelivered. Use Term for poison messages that will never process successfully. Without either, the message sits unacknowledged until ack_wait expires, then gets redelivered automatically.

Can high redelivery cause a consumer to fall behind?

Yes. Every redelivered message competes with new messages for processing capacity. If a consumer processes 1,000 msg/s and 20% are redeliveries, effective throughput for new messages drops to 800 msg/s. Meanwhile, the redelivered messages that fail again create more redeliveries in the next cycle. This feedback loop can push ack pending to the limit (OPT_SYS_003), stalling delivery entirely.

Should I set max_deliver on every consumer?

Yes. Without max_deliver, a poison message will be redelivered indefinitely, consuming resources forever. Set max_deliver to a reasonable value (typically 3-10) and always handle the max delivery advisory to route failed messages somewhere observable — a dead letter stream, an alert, a log. Silent message loss from hitting max_deliver without monitoring is worse than the redelivery loop.

Proactive monitoring for NATS high consumer redelivery with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel