NATS Outstanding Ack Critical: Causes and Remediation

Outstanding ack critical means a JetStream consumer has more in-flight, unacknowledged messages than the operator-defined threshold. These are messages the server has delivered to a client but hasn’t received an acknowledgment for — they’re in limbo between “sent” and “confirmed processed.” When this count climbs past the threshold, it signals that consumer processing capacity cannot keep up with the delivery rate, or that something is preventing acknowledgments from being sent.

Why this matters

Every unacknowledged message represents work that might not complete. The server has delivered the message and is waiting for confirmation. If the ack doesn’t arrive before the ack_wait timeout, the server redelivers the message to the same or a different consumer instance. This creates a compounding problem: redelivered messages consume processing capacity that could handle new messages, which pushes more messages into the ack-pending state, which triggers more redeliveries.

At critical levels, this cycle becomes a redelivery storm. The consumer spends most of its time re-processing messages it already attempted, while new messages pile up in the num_pending backlog. The effective throughput drops to a fraction of what the consumer can handle under normal conditions. In the worst case, messages hit their max_deliver limit and are sent to the advisory subject (or dropped entirely if no advisory consumer exists), resulting in permanent message loss.

The ack-pending count also has a hard ceiling: max_ack_pending on the consumer configuration (default 1,000 for pull consumers). Once this limit is reached, the server stops delivering new messages entirely — the consumer is back-pressured. If operators aren’t monitoring this, the stream’s num_pending grows unbounded while the consumer appears to be “working” but is actually throttled by its own unacknowledged backlog.

Server-side, a high ack-pending count increases memory pressure. The server tracks every pending ack with metadata (sequence number, delivery count, timestamp, reply subject). At scale — thousands of consumers each with thousands of pending acks — this metadata overhead becomes significant.

Common causes

Slow message processing. The consumer receives messages faster than it can process and acknowledge them. Database writes, HTTP calls, or complex computations in the message handler take longer than the inter-message delivery interval. The ack-pending count grows with each message that takes longer to process than the delivery rate allows.
Ack wait timeout too short. The ack_wait is configured shorter than the processing time for complex messages. Messages time out and are redelivered while the consumer is still processing the original delivery. Both the original and redelivered copies now occupy ack-pending slots.
Consumer crash or restart without ack. A consumer instance processes messages but crashes before sending acknowledgments. On restart (or when another instance picks up the messages), all previously delivered messages are still in ack-pending state. If the consumer frequently restarts, ack-pending accumulates with each cycle.
Network issues preventing ack delivery. The consumer processes messages successfully and sends acks, but network problems between the client and server cause acks to be lost or delayed. The server never receives the acks and counts the messages as outstanding.
max_ack_pending set too high relative to processing capacity. A high max_ack_pending allows the server to deliver a large batch of messages that the consumer can’t process within the ack_wait window. The consumer is overwhelmed, acks slow down, and the pending count stays elevated.
Poison messages causing processing failures. Messages that consistently fail processing (malformed data, schema mismatches, missing dependencies) are never acknowledged. They consume ack-pending slots, are redelivered, fail again, and consume more slots — crowding out healthy messages.

How to diagnose

Check the consumer’s ack-pending count

nats consumer info ORDERS my-consumer --json | jq '{
  num_ack_pending: .num_ack_pending,
  num_pending: .num_pending,
  num_redelivered: .num_redelivered,
  num_waiting: .num_waiting,
  config_max_ack_pending: .config.max_ack_pending,
  config_ack_wait: .config.ack_wait
}'

Key indicators:

num_ack_pending near max_ack_pending — consumer is being throttled
num_redelivered increasing — ack timeouts are causing redeliveries
num_pending growing — new messages backing up behind the ack bottleneck

Monitor ack-pending over time

# Watch the ack-pending count in real time
watch -n 5 'nats consumer info ORDERS my-consumer --json | jq .num_ack_pending'

If the count is stable and near the threshold, the consumer is consistently at capacity. If it’s spiking and recovering, the problem is intermittent (likely correlated with traffic spikes or periodic processing slowdowns).

Check for redelivery storms

# High redelivery count indicates acks aren't arriving in time
nats consumer info ORDERS my-consumer --json | jq '{
  num_redelivered: .num_redelivered,
  delivered_consumer_seq: .delivered.consumer_seq,
  redelivery_ratio: (.num_redelivered / (.delivered.consumer_seq + 1) * 100 | tostring + "%")
}'

A redelivery ratio above 10% suggests the ack_wait is too aggressive or processing is too slow.

Identify processing bottlenecks programmatically

1
import (
2
    "fmt"
3
    "github.com/nats-io/nats.go"
4
)
5

6
func checkAckPending(js nats.JetStreamContext, streamName string, threshold int) error {
7
    for consumer := range js.ConsumerNames(streamName) {
8
        info, err := js.ConsumerInfo(streamName, consumer)
9
        if err != nil {
10
            continue
11
        }
12
        if info.NumAckPending > threshold {
13
            pct := float64(info.NumAckPending) / float64(info.Config.MaxAckPending) * 100
14
            fmt.Printf("CRITICAL: stream=%s consumer=%s ack_pending=%d max=%d (%.1f%%) redelivered=%d\n",
15
                streamName, consumer, info.NumAckPending,
16
                info.Config.MaxAckPending, pct, info.NumRedelivered)
17
        }
18
    }
19
    return nil
20
}

1
import asyncio
2
import nats
3

4
async def check_ack_pending(stream_name: str, threshold: int):
5
    nc = await nats.connect()
6
    js = nc.jetstream()
7

8
    async for consumer_name in js.consumer_names(stream_name):
9
        info = await js.consumer_info(stream_name, consumer_name)
10
        if info.num_ack_pending > threshold:
11
            pct = (info.num_ack_pending / info.config.max_ack_pending) * 100
12
            print(f"CRITICAL: stream={stream_name} consumer={consumer_name} "
13
                  f"ack_pending={info.num_ack_pending} max={info.config.max_ack_pending} "
14
                  f"({pct:.1f}%) redelivered={info.num_redelivered}")
15

16
    await nc.close()
17

18
asyncio.run(check_ack_pending("ORDERS", 5000))

How to fix it

Immediate: relieve the pressure

Extend the ack wait timeout. If messages are being processed but acks arrive after the timeout, increase ack_wait to give the consumer more time:

nats consumer edit ORDERS my-consumer --wait 60s

Set ack_wait to at least 2-3x your p99 processing time to account for variance.

Use in-progress acknowledgments. For long-running processing, send +WPI (work in progress) acks to reset the ack timer without completing the message:

1
sub, _ := js.PullSubscribe("orders.>", "my-consumer")
2
msgs, _ := sub.Fetch(10)
3
for _, msg := range msgs {
4
    msg.InProgress() // reset ack timer
5
    processMessage(msg) // long-running work
6
    msg.Ack()
7
}

1
async def process_messages(js):
2
    sub = await js.pull_subscribe("orders.>", "my-consumer")
3
    msgs = await sub.fetch(10)
4
    for msg in msgs:
5
        await msg.in_progress()  # reset ack timer
6
        await process_message(msg)
7
        await msg.ack()

Reduce max_ack_pending to limit concurrency. If the consumer is overwhelmed by too many concurrent messages, lower max_ack_pending so fewer messages are in flight at once:

nats consumer edit ORDERS my-consumer --max-pending 100

Short-term: increase processing capacity

Scale horizontally. Add more consumer instances in the same consumer group. Each instance handles a portion of the messages, reducing per-instance ack-pending:

# Each instance subscribes to the same durable consumer
# For pull consumers, multiple clients can pull from the same consumer

Move blocking work out of the message handler. Process messages asynchronously to free the message handler for the next delivery:

1
work := make(chan *nats.Msg, 1000)
2

3
// Fast message handler — just enqueue
4
sub, _ := js.Subscribe("orders.>", func(msg *nats.Msg) {
5
    work <- msg
6
}, nats.Durable("my-consumer"), nats.ManualAck())
7

8
// Worker pool — does the heavy lifting
9
for i := 0; i < 10; i++ {
10
    go func() {
11
        for msg := range work {
12
            processMessage(msg)
13
            msg.Ack()
14
        }
15
    }()
16
}

Long-term: handle poison messages

Implement dead-letter handling. Messages that consistently fail processing should be moved to a dead-letter stream instead of endlessly retried:

1
sub, _ := js.Subscribe("orders.>", func(msg *nats.Msg) {
2
    meta, _ := msg.Metadata()
3
    if meta.NumDelivered > 3 {
4
        // Move to dead-letter stream
5
        js.Publish("dead-letter.orders", msg.Data)
6
        msg.Term() // terminate redelivery
7
        return
8
    }
9
    if err := processMessage(msg); err != nil {
10
        msg.Nak() // request redelivery
11
        return
12
    }
13
    msg.Ack()
14
}, nats.Durable("my-consumer"), nats.ManualAck())

Set max_deliver to cap redelivery attempts. Prevent infinite redelivery loops by limiting how many times a message can be redelivered:

nats consumer edit ORDERS my-consumer --max-deliver 5

Monitor with Insights. Synadia Insights evaluates CONSUMER_006 against your configured threshold, alerting before the ack-pending count reaches max_ack_pending and causes the server to stop delivering messages.

Frequently asked questions

What’s the difference between num_ack_pending and num_pending?

num_ack_pending is messages delivered but not yet acknowledged — they’re actively being processed (or waiting to be redelivered). num_pending is messages in the stream that haven’t been delivered to this consumer yet — the backlog. A healthy consumer has low num_ack_pending and decreasing num_pending. High num_ack_pending with growing num_pending means the consumer is stuck.

How do I set the threshold for this check?

The threshold is set via the io.nats.monitor.outstanding-ack-critical metadata key on the stream or consumer configuration. Set it based on your expected processing capacity: if your consumer can handle 1,000 in-flight messages, set the threshold to 800 (80%) to alert before saturation. Synadia Insights reads this metadata automatically.

Does lowering max_ack_pending fix the problem?

It controls the symptom, not the cause. Lowering max_ack_pending prevents the server from overwhelming the consumer, which stabilizes the system. But the root cause — slow processing, short ack_wait, or insufficient consumer instances — still needs to be addressed. Think of max_ack_pending as a safety valve, not a fix.

Can I Nak all pending messages to clear the backlog?

You can, but it doesn’t help — Nak causes immediate redelivery, which puts the messages right back into ack-pending. If you want to skip messages, use msg.Term() to permanently terminate their delivery (they won’t be redelivered). Use this only for messages you’re willing to lose.

FEATURED

RESOURCES

Comparisons