Checks/OPT_SYS_003

NATS Ack Pending Buildup: What It Means and How to Fix It

Severity
Warning
Category
Errors
Applies to
System Improvement
Check ID
OPT_SYS_003
Detection threshold
num_ack_pending reaches configured percentage of max_ack_pending (default: 80%)

Ack pending buildup occurs when a JetStream consumer has a growing number of delivered-but-unacknowledged messages approaching the max_ack_pending limit. When that limit is reached, the server stops delivering new messages to the consumer entirely, creating a hard stall that persists until in-flight messages are acknowledged, nak’d, or expire past their ack_wait window.

Why this matters

The max_ack_pending limit is JetStream’s built-in backpressure mechanism. It prevents a consumer from accepting more messages than it can handle by capping how many can be in-flight simultaneously. When ack pending climbs to 80% or more of this limit, the consumer is approaching the point where delivery will pause — and a minor slowdown or latency spike is enough to push it over the edge.

Once max_ack_pending is reached, the server holds all new messages in the stream. They don’t disappear — they accumulate as num_pending. But no new messages are delivered until the consumer acknowledges or otherwise resolves some of its in-flight messages. For real-time workloads, this means the consumer falls behind. For request-reply patterns built on JetStream, it means timeouts. For event-driven architectures, it means cascading delays across every downstream service waiting on those messages.

The buildup is often gradual. A consumer that processes messages in 50ms at average load may creep to 200ms under peak traffic, slowly filling the ack pending pool. Everything looks fine until the limit is hit and delivery stops abruptly. There’s no graceful degradation — it’s a cliff. By the time you notice, the consumer may have thousands of undelivered messages queued in the stream, and recovery requires both clearing the backlog and addressing whatever caused the processing slowdown.

Common causes

  • Processing throughput below publish rate. The consumer processes messages slower than they arrive. If the stream receives 5,000 msg/s and the consumer processes 3,000 msg/s, the ack pending pool fills at 2,000 msg/s. At the default max_ack_pending of 1,000, the limit is hit in under a second.

  • Downstream dependency latency. The consumer calls a database, API, or external service during processing. When that dependency slows down — connection pool exhaustion, query timeouts, rate limiting — each message takes longer to process, and ack pending climbs.

  • Single-threaded or under-parallelized processing. The consumer processes messages sequentially when the workload could be parallelized. A single-threaded consumer with 100ms processing time per message can handle at most 10 msg/s, regardless of how high max_ack_pending is set.

  • Batch processing holding messages. The consumer collects messages into batches (for bulk database inserts, for example) and only acknowledges them when the batch completes. A batch size of 500 with 30-second batch windows means 500 messages sit in ack pending for the entire window.

  • max_ack_pending set too low. The default max_ack_pending is 1,000. A limit of 256 or 1,000 may be adequate for low-throughput streams but completely insufficient for high-rate workloads. The limit should reflect how many messages your consumer can realistically have in-flight at any moment. Note that ack pending can also be limited at the stream level via consumer limits and at the account level.

  • Consumer not acknowledging on error paths. The happy path sends Ack, but error handlers that log and continue without calling Nak or Term leave messages in limbo until ack_wait expires. Each unresolved message occupies an ack pending slot.

How to diagnose

Check ack pending levels

Get the current ack pending count and limit for a specific consumer:

Terminal window
nats consumer info <stream_name> <consumer_name>

Look for:

  • Ack Pending — Current count of delivered, unacknowledged messages
  • Max Ack Pending — The configured limit

Compare these to calculate how close you are to stalling. At 80%+, the consumer is in the danger zone.

For a quick overview across all consumers on a stream:

Terminal window
nats consumer report <stream_name>

Determine if the consumer is stalled

If num_pending is growing while num_ack_pending equals max_ack_pending, the consumer is stalled. Messages are accumulating in the stream with no delivery:

Terminal window
nats consumer info <stream_name> <consumer_name> --json | jq '{
ack_pending: .num_ack_pending,
max_ack_pending: .config.max_ack_pending,
pending: .num_pending,
stalled: (.num_ack_pending >= .config.max_ack_pending)
}'

Identify the processing bottleneck

Check whether the issue is throughput or latency:

  • If ack pending is near the limit but redeliveries are low, processing is slow but succeeding — the bottleneck is throughput.
  • If ack pending is near the limit and redeliveries are high (see OPT_SYS_002), messages are timing out without being acknowledged — the bottleneck is error handling or ack_wait misconfiguration.

Check for downstream dependency issues by correlating ack pending spikes with latency metrics on databases, APIs, or other services your consumer depends on.

How to fix it

Immediate: relieve the pressure

Increase max_ack_pending if the consumer can handle more in-flight messages. The default is 1,000. This is only appropriate if the consumer has processing headroom but the limit is artificially low:

Terminal window
nats consumer edit <stream_name> <consumer_name> --max-pending=5000

Don’t set this arbitrarily high. max_ack_pending should reflect the actual number of messages your consumer can have in-flight without degrading. Setting it to 1,000,000 when your consumer can only process 100/s just delays the stall and increases memory usage.

Check for stream-level and account-level limits. Ack pending can also be constrained at the stream level via consumer limits and at the account level via account JetStream limits. If the consumer’s max_ack_pending is set appropriately but ack pending is still capped, check these higher-level limits.

Acknowledge messages on all code paths. Audit your message handler to ensure every exit path — success, transient error, permanent error — sends an appropriate response:

1
// Go client (nats.go)
2
func handleMessage(msg *nats.Msg) {
3
data, err := unmarshal(msg.Data)
4
if err != nil {
5
_ = msg.Term() // Permanent failure — stop redelivering
6
return
7
}
8
if err := process(data); err != nil {
9
_ = msg.NakWithDelay(5 * time.Second) // Transient — retry later
10
return
11
}
12
_ = msg.Ack()
13
}

Short-term: increase processing throughput

Parallelize message processing within a single consumer instance. Fetch messages in batches and process them concurrently:

1
// Go client — parallel processing with pull subscribe
2
sub, _ := js.PullSubscribe("EVENTS.>", "event-processor")
3
for {
4
msgs, _ := sub.Fetch(100, nats.MaxWait(5*time.Second))
5
var wg sync.WaitGroup
6
for _, msg := range msgs {
7
wg.Add(1)
8
go func(m *nats.Msg) {
9
defer wg.Done()
10
if err := processEvent(m); err != nil {
11
_ = m.Nak()
12
return
13
}
14
_ = m.Ack()
15
}(msg)
16
}
17
wg.Wait()
18
}
1
// TypeScript (nats.js)
2
import { connect, AckPolicy, DeliverPolicy } from "nats";
3
4
const nc = await connect();
5
const js = nc.jetstream();
6
const consumer = await js.consumers.get("EVENTS", "event-processor");
7
8
while (true) {
9
const batch = consumer.fetch({ max_messages: 100, expires: 5000 });
10
const promises: Promise<void>[] = [];
11
for await (const msg of batch) {
12
promises.push(
13
processEvent(msg.data)
14
.then(() => msg.ack())
15
.catch(() => msg.nak())
16
);
17
}
18
await Promise.all(promises);
19
}

Scale out with multiple consumer instances. For pull consumers, multiple instances can pull from the same durable consumer. For push consumers, use a queue group. Each additional instance linearly increases aggregate throughput:

Terminal window
# Verify the consumer allows multiple instances
nats consumer info <stream_name> <consumer_name> --json | jq '.config.max_ack_pending'

Long-term: design for sustained throughput

Right-size max_ack_pending based on measured capacity. Profile your consumer under load: measure P99 processing time, maximum concurrent processing slots, and sustained throughput. Set max_ack_pending to: (concurrent_workers) × (processing_time_p99 / fetch_interval) × safety_margin. This ensures the limit reflects actual capacity.

Decouple acknowledgment from downstream completion. If the consumer writes to a database, consider acknowledging the NATS message once the data is durably queued in a local write-ahead buffer, rather than waiting for the full downstream write to complete. This trades exactly-once downstream delivery (which NATS doesn’t guarantee anyway) for significantly better ack throughput.

Implement adaptive concurrency. Monitor your consumer’s ack pending ratio and automatically adjust worker pool size. When ack pending exceeds 50% of the limit, scale up workers. When it drops below 20%, scale down to conserve resources.

Frequently asked questions

What happens when max_ack_pending is reached?

The server stops delivering new messages to the consumer. Messages continue arriving in the stream and accumulate as num_pending, but no new deliveries occur. Delivery resumes only when in-flight messages are acknowledged, nak’d, or expire past ack_wait. This is an abrupt stop, not a gradual slowdown — the consumer goes from receiving messages to receiving nothing.

What is a good value for max_ack_pending?

It depends on your consumer’s processing parallelism and latency. For a consumer with 10 worker threads and 100ms average processing time, 1,000 is a reasonable starting point — it provides 10 seconds of buffer. For high-throughput consumers with fast processing, 5,000-10,000 may be appropriate. The key principle: set it high enough to absorb latency variance but low enough that hitting the limit is a meaningful signal, not a catastrophic surprise.

How is ack pending different from consumer lag?

Ack pending counts messages that have been delivered to the consumer but not yet acknowledged — they’re in-flight. Consumer lag (often shown as num_pending) counts messages in the stream that haven’t been delivered yet. A consumer can have zero lag but high ack pending (all messages delivered, none acknowledged) or high lag but zero ack pending (delivery stalled at the limit). Both metrics together tell the full story.

Can I have multiple consumers pull from the same durable to reduce ack pending pressure?

Yes. Multiple instances can pull from the same pull-based durable consumer. The server distributes messages across instances, effectively multiplying your processing throughput. Each instance contributes to the shared ack pending pool, so the same max_ack_pending limit applies across all instances. This is the simplest way to scale consumer throughput horizontally.

Why do my ack pending counts spike during deployments?

During a rolling deployment, consumer instances restart sequentially. When an instance shuts down, its in-flight messages remain in ack pending until ack_wait expires — they can’t be redelivered to another instance until then. With a 30-second ack_wait and 500 in-flight messages per instance, a restart temporarily locks 500 ack pending slots for 30 seconds. Use graceful shutdown that acknowledges or nak’s in-flight messages before stopping to avoid this.

Proactive monitoring for NATS ack pending buildup with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel