Consumer ACK floor divergence occurs when the gap between a JetStream consumer’s delivered sequence and its acknowledged (ACK) floor grows disproportionately large. The ACK floor is the highest contiguous sequence number where every message at or below it has been acknowledged. When messages are acknowledged out of order, the ACK floor stalls even though later messages have been processed. The server must then track each individually acknowledged message between the floor and the delivered position in memory, creating pressure that grows with the size of the gap.
JetStream tracks consumer progress using two key markers: the delivered sequence (the highest message sent to the consumer) and the ACK floor (the highest contiguous acknowledged sequence). In an ideal processing pipeline, these two values advance roughly in lockstep — the consumer receives messages, processes them, and acknowledges them in order. The gap between them stays close to the max_ack_pending limit.
When acknowledgments arrive out of order, the ACK floor cannot advance past the lowest unacknowledged message. If message 100 is delivered but messages 101–10,000 are all acknowledged while message 100 remains pending, the ACK floor stays at 99. The server must maintain an in-memory bitmap of those 9,900 individual acknowledgments. This bitmap is also replicated to consumer replicas and persisted during snapshots.
The operational consequences compound quickly. Memory consumption on the consumer’s leader server grows linearly with the number of tracked individual ACKs. During server restart or leader failover, the consumer state recovery takes longer because the entire ACK bitmap must be reconstructed. Replica catch-up after a transient partition carries more data. And if the divergence is caused by messages that will never be acknowledged (poison messages, expired processing), the gap grows indefinitely until the consumer is manually intervened upon.
For operators, ACK floor divergence is an early warning of a processing pipeline that’s partially broken — the consumer appears healthy because messages are being delivered and most are being acknowledged, but a subset is silently failing. Without this check, the degradation goes unnoticed until memory pressure causes server-level symptoms.
Out-of-order processing with selective acknowledgment. Multiple worker goroutines or threads process messages concurrently. Fast messages are acknowledged immediately while slow messages (database writes, external API calls) remain pending. This is the most common cause in parallelized consumer architectures.
Poison messages that repeatedly fail processing. A message that causes a processing error is redelivered, fails again, and is never acknowledged. Meanwhile, all subsequent messages are processed normally. The ACK floor is permanently stuck at the poison message’s sequence number until it’s explicitly terminated via nats consumer next --term or the max delivery count is reached.
Slow processing of specific message types. In a heterogeneous message stream, some message types take significantly longer to process (e.g., large payloads, messages requiring external enrichment). If these are interspersed with fast messages, the slow ones create ACK gaps while fast messages advance the delivered position.
Consumer client crashes during processing. A consumer receives a batch of messages, partially processes them, and crashes. On reconnect, the consumer resumes from the ACK floor (not from the crash point), but the server still has the unacknowledged messages between the floor and the last delivered sequence tracked as pending.
AckExplicit policy with insufficient max_ack_pending. When max_ack_pending is set high (or unlimited), the consumer can have a very large number of in-flight messages. If even a small percentage fail to acknowledge, the absolute gap between the ACK floor and the delivered position can be enormous.
Long ACK wait timeouts. If the consumer’s ack_wait is set very long (e.g., 30 minutes or more), messages linger in pending state for extended periods, widening the gap even during normal processing.
Inspect the consumer’s delivered and ACK floor positions:
nats consumer info STREAM CONSUMERLook at the Ack Floor and Last Delivered fields. The gap between their stream sequence numbers indicates the divergence. A healthy consumer has a gap close to max_ack_pending. A diverged consumer has a gap many multiples of max_ack_pending, or exceeding tens of thousands.
nats consumer info STREAM CONSUMER --json | jq '{ ack_floor_stream_seq: .ack_floor.stream_seq, delivered_stream_seq: .delivered.stream_seq, gap: (.delivered.stream_seq - .ack_floor.stream_seq), num_ack_pending: .num_ack_pending, num_redelivered: .num_redelivered, max_ack_pending: .config.max_ack_pending}'If gap is significantly larger than num_ack_pending, it means many messages in the gap have been individually acknowledged but the floor hasn’t advanced because of holes. If gap roughly equals num_ack_pending, the consumer is simply behind on all messages in the range.
Find the specific messages that are blocking the ACK floor from advancing:
# Check pending messages (unacknowledged)nats consumer info STREAM CONSUMER --json | jq '.cluster, .num_ack_pending, .num_redelivered, .num_pending'The oldest pending messages are the ones holding back the ACK floor. Check their sequence numbers and ages — messages with very old delivery timestamps relative to newer pending messages indicate selective processing failures.
1package main2
3import (4 "context"5 "fmt"6 "github.com/nats-io/nats.go/jetstream"7)8
9func checkACKFloorDivergence(js jetstream.JetStream, stream, consumer string, maxGapRatio float64) error {10 ci, err := js.Consumer(context.Background(), stream, consumer)11 if err != nil {12 return err13 }14 info, err := ci.Info(context.Background())15 if err != nil {16 return err17 }18
19 gap := info.Delivered.Stream - info.AckFloor.Stream20 maxPending := info.Config.MaxAckPending21 if maxPending == 0 {22 maxPending = 65536 // server default23 }24
25 ratio := float64(gap) / float64(maxPending)26 if ratio > maxGapRatio {27 fmt.Printf("WARN: consumer %s/%s ack floor divergence: gap=%d max_ack_pending=%d ratio=%.1f\n",28 stream, consumer, gap, maxPending, ratio)29 }30 return nil31}1import nats2from nats.js import JetStreamContext3
4async def check_ack_floor_divergence(5 js: JetStreamContext,6 stream: str,7 consumer: str,8 max_gap_ratio: float = 5.0,9):10 info = await js.consumer_info(stream, consumer)11 gap = info.delivered.stream_seq - info.ack_floor.stream_seq12 max_pending = info.config.max_ack_pending or 6553613
14 ratio = gap / max_pending15 if ratio > max_gap_ratio:16 return {17 "consumer": f"{stream}/{consumer}",18 "gap": gap,19 "max_ack_pending": max_pending,20 "ratio": round(ratio, 1),21 "num_redelivered": info.num_redelivered,22 }23 return NoneTerminate poison messages. If specific messages are blocking the floor because they consistently fail processing, terminate them to remove them from the pending set:
# Inspect the consumer's pending state to find the stuck floornats consumer info STREAM CONSUMER --json | jq '{ack_floor: .ack_floor, num_ack_pending: .num_ack_pending, num_redelivered: .num_redelivered}'
# Terminate the next pending message (counts as acknowledged, never redelivered)nats consumer next STREAM CONSUMER --termAlternatively, configure max_deliver on the consumer to automatically give up on messages after a set number of redelivery attempts:
nats consumer edit STREAM CONSUMER --max-deliver 5Pause and resume the consumer. If the divergence is extreme and the backlog of tracked ACKs is causing memory pressure, consider pausing the consumer, allowing existing pending messages to expire via ack_wait, and then resuming. This resets the tracking state at the cost of reprocessing some messages.
Process messages in order within partitions. If your consumer uses multiple workers, partition work by a key (e.g., subject or message header) so that messages within a partition are always processed in order. This ensures the ACK floor advances steadily for each partition:
1// Partition workers by subject hash2sub, _ := js.Subscribe("orders.*", func(msg *nats.Msg) {3 // Route to a worker based on subject, ensuring serial processing per partition4 workerIdx := hash(msg.Subject) % numWorkers5 workers[workerIdx] <- msg6}, nats.OrderedConsumer())Reduce ack_wait to match actual processing time. If ack_wait is much longer than typical processing time, failed messages take too long to be redelivered and re-enter the pending pool. Set ack_wait to a reasonable multiple (2–3x) of the 99th percentile processing time:
nats consumer edit STREAM CONSUMER --wait 30sImplement dead-letter handling. Instead of relying on max_deliver alone, set up an advisory-based dead-letter pattern. When messages exceed max delivery, capture them to a dead-letter stream for manual investigation rather than letting them silently poison the consumer state.
Switch to AckAll policy. If your processing pipeline can tolerate batch acknowledgment — meaning if you acknowledge message N, all messages before N are also considered acknowledged — use AckAll instead of AckExplicit. This eliminates the ACK floor divergence problem entirely because every acknowledgment advances the floor:
nats consumer add STREAM CONSUMER \ --ack all \ --max-pending 1000 \ --deliver allThe tradeoff: if the consumer crashes after processing message N but before acknowledging it, messages between the old floor and N will be redelivered. Your processing logic must be idempotent.
Right-size max_ack_pending. A lower max_ack_pending limits how far the delivered position can get ahead of acknowledgments, bounding the maximum possible divergence. Set it to match your actual processing concurrency — there’s no benefit to having 65,536 messages in flight if you only have 10 workers.
The server maintains an in-memory set of individually acknowledged sequences within the gap. Each entry is small (a sequence number), but at scale — gaps of hundreds of thousands — the memory and CPU cost for set operations becomes measurable. More importantly, consumer state snapshots grow larger, making leader elections and replica catch-up slower. The impact is proportional to the absolute gap size, not the ratio.
No. Messages are not lost due to ACK floor divergence. Unacknowledged messages will be redelivered after ack_wait expires, up to max_deliver times. The risk is operational: memory pressure, slow recovery, and potential masking of processing failures where messages are silently failing and being redelivered in a loop.
ACK pending buildup (CONSUMER_003) measures the count of currently unacknowledged messages. ACK floor divergence measures the gap between the contiguous acknowledgment point and the delivery point. You can have low ACK pending (most messages acknowledged) but high ACK floor divergence (acknowledgments are scattered with holes). The divergence check specifically catches the interleaved-ACK pattern that pending count alone doesn’t reveal.
Not directly. The ACK floor advances only through contiguous acknowledgments. The practical options are: terminate the blocking messages (so the floor advances past them), let them expire via ack_wait plus max_deliver, or delete and recreate the consumer with a deliver_policy that starts from the desired position. Recreating the consumer is the cleanest approach when the divergence is severe.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community