Checks/OPT_SYS_015

NATS Consumer ACK Floor Divergence: Causes, Diagnosis, and Fixes

Severity
Warning
Category
Errors
Applies to
JetStream
Check ID
OPT_SYS_015
Detection threshold
gap between delivered position and ACK floor disproportionately large relative to max_ack_pending or exceeds absolute thresholds

Consumer ACK floor divergence occurs when the gap between a JetStream consumer’s delivered sequence and its acknowledged (ACK) floor grows disproportionately large. The ACK floor is the highest contiguous sequence number where every message at or below it has been acknowledged. When messages are acknowledged out of order, the ACK floor stalls even though later messages have been processed. The server must then track each individually acknowledged message between the floor and the delivered position in memory, creating pressure that grows with the size of the gap.

Why this matters

JetStream tracks consumer progress using two key markers: the delivered sequence (the highest message sent to the consumer) and the ACK floor (the highest contiguous acknowledged sequence). In an ideal processing pipeline, these two values advance roughly in lockstep — the consumer receives messages, processes them, and acknowledges them in order. The gap between them stays close to the max_ack_pending limit.

When acknowledgments arrive out of order, the ACK floor cannot advance past the lowest unacknowledged message. If message 100 is delivered but messages 101–10,000 are all acknowledged while message 100 remains pending, the ACK floor stays at 99. The server must maintain an in-memory bitmap of those 9,900 individual acknowledgments. This bitmap is also replicated to consumer replicas and persisted during snapshots.

The operational consequences compound quickly. Memory consumption on the consumer’s leader server grows linearly with the number of tracked individual ACKs. During server restart or leader failover, the consumer state recovery takes longer because the entire ACK bitmap must be reconstructed. Replica catch-up after a transient partition carries more data. And if the divergence is caused by messages that will never be acknowledged (poison messages, expired processing), the gap grows indefinitely until the consumer is manually intervened upon.

For operators, ACK floor divergence is an early warning of a processing pipeline that’s partially broken — the consumer appears healthy because messages are being delivered and most are being acknowledged, but a subset is silently failing. Without this check, the degradation goes unnoticed until memory pressure causes server-level symptoms.

Common causes

  • Out-of-order processing with selective acknowledgment. Multiple worker goroutines or threads process messages concurrently. Fast messages are acknowledged immediately while slow messages (database writes, external API calls) remain pending. This is the most common cause in parallelized consumer architectures.

  • Poison messages that repeatedly fail processing. A message that causes a processing error is redelivered, fails again, and is never acknowledged. Meanwhile, all subsequent messages are processed normally. The ACK floor is permanently stuck at the poison message’s sequence number until it’s explicitly terminated via nats consumer next --term or the max delivery count is reached.

  • Slow processing of specific message types. In a heterogeneous message stream, some message types take significantly longer to process (e.g., large payloads, messages requiring external enrichment). If these are interspersed with fast messages, the slow ones create ACK gaps while fast messages advance the delivered position.

  • Consumer client crashes during processing. A consumer receives a batch of messages, partially processes them, and crashes. On reconnect, the consumer resumes from the ACK floor (not from the crash point), but the server still has the unacknowledged messages between the floor and the last delivered sequence tracked as pending.

  • AckExplicit policy with insufficient max_ack_pending. When max_ack_pending is set high (or unlimited), the consumer can have a very large number of in-flight messages. If even a small percentage fail to acknowledge, the absolute gap between the ACK floor and the delivered position can be enormous.

  • Long ACK wait timeouts. If the consumer’s ack_wait is set very long (e.g., 30 minutes or more), messages linger in pending state for extended periods, widening the gap even during normal processing.

How to diagnose

Check consumer state

Inspect the consumer’s delivered and ACK floor positions:

Terminal window
nats consumer info STREAM CONSUMER

Look at the Ack Floor and Last Delivered fields. The gap between their stream sequence numbers indicates the divergence. A healthy consumer has a gap close to max_ack_pending. A diverged consumer has a gap many multiples of max_ack_pending, or exceeding tens of thousands.

Compare floor to pending count

Terminal window
nats consumer info STREAM CONSUMER --json | jq '{
ack_floor_stream_seq: .ack_floor.stream_seq,
delivered_stream_seq: .delivered.stream_seq,
gap: (.delivered.stream_seq - .ack_floor.stream_seq),
num_ack_pending: .num_ack_pending,
num_redelivered: .num_redelivered,
max_ack_pending: .config.max_ack_pending
}'

If gap is significantly larger than num_ack_pending, it means many messages in the gap have been individually acknowledged but the floor hasn’t advanced because of holes. If gap roughly equals num_ack_pending, the consumer is simply behind on all messages in the range.

Identify stuck messages

Find the specific messages that are blocking the ACK floor from advancing:

Terminal window
# Check pending messages (unacknowledged)
nats consumer info STREAM CONSUMER --json | jq '.cluster, .num_ack_pending, .num_redelivered, .num_pending'

The oldest pending messages are the ones holding back the ACK floor. Check their sequence numbers and ages — messages with very old delivery timestamps relative to newer pending messages indicate selective processing failures.

Monitor programmatically

1
package main
2
3
import (
4
"context"
5
"fmt"
6
"github.com/nats-io/nats.go/jetstream"
7
)
8
9
func checkACKFloorDivergence(js jetstream.JetStream, stream, consumer string, maxGapRatio float64) error {
10
ci, err := js.Consumer(context.Background(), stream, consumer)
11
if err != nil {
12
return err
13
}
14
info, err := ci.Info(context.Background())
15
if err != nil {
16
return err
17
}
18
19
gap := info.Delivered.Stream - info.AckFloor.Stream
20
maxPending := info.Config.MaxAckPending
21
if maxPending == 0 {
22
maxPending = 65536 // server default
23
}
24
25
ratio := float64(gap) / float64(maxPending)
26
if ratio > maxGapRatio {
27
fmt.Printf("WARN: consumer %s/%s ack floor divergence: gap=%d max_ack_pending=%d ratio=%.1f\n",
28
stream, consumer, gap, maxPending, ratio)
29
}
30
return nil
31
}
1
import nats
2
from nats.js import JetStreamContext
3
4
async def check_ack_floor_divergence(
5
js: JetStreamContext,
6
stream: str,
7
consumer: str,
8
max_gap_ratio: float = 5.0,
9
):
10
info = await js.consumer_info(stream, consumer)
11
gap = info.delivered.stream_seq - info.ack_floor.stream_seq
12
max_pending = info.config.max_ack_pending or 65536
13
14
ratio = gap / max_pending
15
if ratio > max_gap_ratio:
16
return {
17
"consumer": f"{stream}/{consumer}",
18
"gap": gap,
19
"max_ack_pending": max_pending,
20
"ratio": round(ratio, 1),
21
"num_redelivered": info.num_redelivered,
22
}
23
return None

How to fix it

Immediate: clear the divergence

Terminate poison messages. If specific messages are blocking the floor because they consistently fail processing, terminate them to remove them from the pending set:

Terminal window
# Inspect the consumer's pending state to find the stuck floor
nats consumer info STREAM CONSUMER --json | jq '{ack_floor: .ack_floor, num_ack_pending: .num_ack_pending, num_redelivered: .num_redelivered}'
# Terminate the next pending message (counts as acknowledged, never redelivered)
nats consumer next STREAM CONSUMER --term

Alternatively, configure max_deliver on the consumer to automatically give up on messages after a set number of redelivery attempts:

Terminal window
nats consumer edit STREAM CONSUMER --max-deliver 5

Pause and resume the consumer. If the divergence is extreme and the backlog of tracked ACKs is causing memory pressure, consider pausing the consumer, allowing existing pending messages to expire via ack_wait, and then resuming. This resets the tracking state at the cost of reprocessing some messages.

Short-term: fix processing patterns

Process messages in order within partitions. If your consumer uses multiple workers, partition work by a key (e.g., subject or message header) so that messages within a partition are always processed in order. This ensures the ACK floor advances steadily for each partition:

1
// Partition workers by subject hash
2
sub, _ := js.Subscribe("orders.*", func(msg *nats.Msg) {
3
// Route to a worker based on subject, ensuring serial processing per partition
4
workerIdx := hash(msg.Subject) % numWorkers
5
workers[workerIdx] <- msg
6
}, nats.OrderedConsumer())

Reduce ack_wait to match actual processing time. If ack_wait is much longer than typical processing time, failed messages take too long to be redelivered and re-enter the pending pool. Set ack_wait to a reasonable multiple (2–3x) of the 99th percentile processing time:

Terminal window
nats consumer edit STREAM CONSUMER --wait 30s

Implement dead-letter handling. Instead of relying on max_deliver alone, set up an advisory-based dead-letter pattern. When messages exceed max delivery, capture them to a dead-letter stream for manual investigation rather than letting them silently poison the consumer state.

Long-term: use AckAll policy where appropriate

Switch to AckAll policy. If your processing pipeline can tolerate batch acknowledgment — meaning if you acknowledge message N, all messages before N are also considered acknowledged — use AckAll instead of AckExplicit. This eliminates the ACK floor divergence problem entirely because every acknowledgment advances the floor:

Terminal window
nats consumer add STREAM CONSUMER \
--ack all \
--max-pending 1000 \
--deliver all

The tradeoff: if the consumer crashes after processing message N but before acknowledging it, messages between the old floor and N will be redelivered. Your processing logic must be idempotent.

Right-size max_ack_pending. A lower max_ack_pending limits how far the delivered position can get ahead of acknowledgments, bounding the maximum possible divergence. Set it to match your actual processing concurrency — there’s no benefit to having 65,536 messages in flight if you only have 10 workers.

Frequently asked questions

What’s the performance impact of a large ACK gap?

The server maintains an in-memory set of individually acknowledged sequences within the gap. Each entry is small (a sequence number), but at scale — gaps of hundreds of thousands — the memory and CPU cost for set operations becomes measurable. More importantly, consumer state snapshots grow larger, making leader elections and replica catch-up slower. The impact is proportional to the absolute gap size, not the ratio.

Does ACK floor divergence cause message loss?

No. Messages are not lost due to ACK floor divergence. Unacknowledged messages will be redelivered after ack_wait expires, up to max_deliver times. The risk is operational: memory pressure, slow recovery, and potential masking of processing failures where messages are silently failing and being redelivered in a loop.

How does this differ from ACK pending buildup?

ACK pending buildup (CONSUMER_003) measures the count of currently unacknowledged messages. ACK floor divergence measures the gap between the contiguous acknowledgment point and the delivery point. You can have low ACK pending (most messages acknowledged) but high ACK floor divergence (acknowledgments are scattered with holes). The divergence check specifically catches the interleaved-ACK pattern that pending count alone doesn’t reveal.

Can I reset the ACK floor without deleting the consumer?

Not directly. The ACK floor advances only through contiguous acknowledgments. The practical options are: terminate the blocking messages (so the floor advances past them), let them expire via ack_wait plus max_deliver, or delete and recreate the consumer with a deliver_policy that starts from the desired position. Recreating the consumer is the cleanest approach when the divergence is severe.

Proactive monitoring for NATS consumer ack floor divergence with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel