NATS Last Ack Critical: Detecting Consumer Acknowledgment Gaps

A last ack critical alert fires when a JetStream consumer has not acknowledged any message within the operator-defined time window. This threshold is set via the io.nats.monitor.last-ack-critical metadata key on the consumer configuration. When the elapsed time since the last acknowledgment exceeds this value, the consumer’s downstream processing pipeline is considered stalled — messages may still be delivering, but nothing is being successfully processed and confirmed.

Why this matters

Acknowledgments are the fundamental signal that work is actually getting done. A consumer can receive messages, a subscriber can be connected, delivery timestamps can look healthy — but if no messages are being acknowledged, no work is completing. The ack is the only proof that a message was successfully processed and can be removed from the consumer’s pending set.

When acks stop flowing, the consequences compound rapidly. The consumer’s ack pending count climbs toward its max_ack_pending limit. Once that limit is reached, the server stops delivering new messages entirely, creating a complete processing halt. Meanwhile, the ack_wait timer ticks down on each unacknowledged message, eventually triggering redeliveries. Those redeliveries consume server resources, increase network traffic, and — if the processing issue isn’t resolved — fail again, creating a redelivery spiral.

The business impact depends on the workload. For an order processing pipeline, ack stalls mean orders are received but never fulfilled. For event-driven analytics, it means data gaps in dashboards and reports. For workflow orchestration, it means tasks stuck in an intermediate state with no forward progress.

What makes ack stalls particularly dangerous is that they often indicate a downstream dependency failure rather than a NATS issue. The consumer application may be running, connected, and receiving messages, but unable to complete processing because a database is down, an API is unreachable, or a disk is full. The ack gap is a symptom of a problem that lives outside the messaging layer — making it easy to miss if you’re only monitoring NATS infrastructure health.

The last-ack-critical threshold provides a direct signal that processing has stalled, regardless of the root cause, enabling operators to investigate before the ack pending limit triggers a complete delivery halt.

Common causes

Downstream dependency failure. The consumer application processes messages by writing to a database, calling an API, or interacting with another service. If that dependency is unavailable, processing fails and no acks are sent. The consumer may continue receiving messages (filling its ack pending buffer) or may have stopped fetching to apply backpressure.
Application crash or deadlock. The consumer process has crashed, frozen, or entered a deadlocked state. The NATS connection may still appear active briefly (TCP keepalive hasn’t fired yet), but the application is no longer processing messages or sending acks.
Ack logic bug. The application processes messages successfully but fails to send the acknowledgment due to a code error — an early return before msg.Ack(), an exception in a finally block, or a conditional path that skips the ack. This is surprisingly common in complex processing pipelines with error handling branches.
Max ack pending reached with all messages stuck. Every message in the ack pending window has been delivered but not acknowledged. The consumer application may be processing them very slowly, or it may have given up on them without nacking. The server waits for acks or ack_wait expiry before redelivering.
Consumer paused or subscribers disconnected. If no subscriber is actively pulling or receiving messages, no deliveries occur and therefore no acks occur. This overlaps with last delivery critical (CONSUMER_009) but the ack perspective catches cases where delivery happened just before the stall.
Network partition between consumer and ack target. In rare cases, the consumer sends acks but they don’t reach the server due to a network issue. The server sees no acks; the consumer thinks processing is succeeding. This typically resolves when the connection is detected as broken and reconnect occurs.
Processing time exceeds ack_wait. Messages are being processed, but each one takes longer than the ack_wait timeout. The server considers them unacknowledged and redelivers them before the consumer finishes processing. The consumer may be acking the original delivery, but the ack arrives too late. This manifests as both ack gaps and high redelivery counts.

How to diagnose

Check consumer ack status

nats consumer info STREAM_NAME CONSUMER_NAME

Look at the Last Ack timestamp and Num Ack Pending. If the last ack is older than your threshold and ack pending is at or near max_ack_pending, the consumer is stalled.

Compare delivery and ack timestamps

nats consumer info STREAM_NAME CONSUMER_NAME -j | \
  jq '{last_delivery: .delivered.last, last_ack: .ack_floor.last, num_ack_pending: .num_ack_pending, num_redelivered: .num_redelivered}'

If last_delivery is recent but last_ack is stale, messages are being delivered but not processed. If both are stale, the consumer isn’t receiving messages either (see CONSUMER_009).

Check if the consumer application is running

Verify that the consumer application process is alive and connected:

nats server report connections --name "consumer-app-name"

If the consumer application doesn’t appear in the connection list, it has disconnected. Check your process manager or orchestrator.

Inspect ack pending details

# Check how many messages are awaiting ack
nats consumer info STREAM_NAME CONSUMER_NAME -j | \
  jq '{num_ack_pending: .num_ack_pending, max_ack_pending: .config.max_ack_pending}'

If num_ack_pending equals max_ack_pending, delivery is blocked. No new messages will be delivered until acks are received.

Programmatic diagnosis

1
// Go: check last ack age and pending state
2
js, _ := nc.JetStream()
3
ci, _ := js.ConsumerInfo("STREAM_NAME", "CONSUMER_NAME")
4

5
lastAck := time.Since(ci.AckFloor.Last)
6
threshold := 5 * time.Minute
7

8
if lastAck > threshold {
9
    fmt.Printf("CRITICAL: consumer %s last ack was %s ago\n",
10
        ci.Name, lastAck.Round(time.Second))
11
    fmt.Printf("  Num Ack Pending: %d / %d (max)\n",
12
        ci.NumAckPending, ci.Config.MaxAckPending)
13
    fmt.Printf("  Num Redelivered: %d\n", ci.NumRedelivered)
14

15
    if ci.NumAckPending >= ci.Config.MaxAckPending {
16
        fmt.Println("  ⚠ Max ack pending reached — delivery is BLOCKED")
17
    }
18
}

1
# Python: check last ack age
2
import nats
3
from datetime import datetime, timezone
4

5
nc = await nats.connect()
6
js = nc.jetstream()
7

8
ci = await js.consumer_info("STREAM_NAME", "CONSUMER_NAME")
9
last_ack = ci.ack_floor.last
10
now = datetime.now(timezone.utc)
11
age = (now - last_ack).total_seconds()
12

13
if age > 300:  # 5 minutes
14
    print(f"CRITICAL: consumer {ci.name} last ack was {age:.0f}s ago")
15
    print(f"  Num Ack Pending: {ci.num_ack_pending}")
16
    print(f"  Num Redelivered: {ci.num_redelivered}")

How to fix it

Immediate: unblock the consumer

Check and restore downstream dependencies. If the consumer’s processing depends on a database, API, or external service, verify that dependency is healthy. Restoring the dependency usually results in acks resuming immediately as the backlog of delivered messages gets processed.

Restart the consumer application if it’s crashed or deadlocked:

# Check if the consumer application process is still running
# If not, restart it through your process manager
systemctl restart consumer-app
# or
kubectl rollout restart deployment/consumer-app

If ack pending is at max and messages are truly stuck, nack them to trigger redelivery:

This is a last resort — it reprocesses messages that may have partially completed:

# Reducing max_ack_pending temporarily can also help clear the backlog
# by forcing the consumer to process fewer messages concurrently

Short-term: fix the processing pipeline

Fix ack logic bugs. Audit every code path in the message handler to ensure msg.Ack() (or msg.Nak() / msg.Term()) is called. Use defer patterns to guarantee ack execution:

1
sub, _ := js.Subscribe("orders.>", func(msg *nats.Msg) {
2
    // Ensure we always respond — ack on success, nak on failure
3
    var err error
4
    defer func() {
5
        if err != nil {
6
            msg.Nak()
7
        } else {
8
            msg.Ack()
9
        }
10
    }()
11

12
    err = processOrder(msg.Data)
13
})

Increase ack_wait if processing legitimately takes a long time. If your message processing takes 2 minutes but ack_wait is 30 seconds, the server redelivers before processing completes. Set ack_wait to comfortably exceed your worst-case processing time:

nats consumer edit STREAM_NAME CONSUMER_NAME --wait 5m

Use in-progress acknowledgments for long-running processing. Send msg.InProgress() periodically to reset the ack_wait timer without completing the ack:

1
go func() {
2
    ticker := time.NewTicker(10 * time.Second)
3
    defer ticker.Stop()
4
    for range ticker.C {
5
        msg.InProgress()
6
    }
7
}()
8
result := longRunningProcess(msg.Data)
9
msg.Ack()

Long-term: build resilience

Implement health checks that verify end-to-end processing, not just connectivity. A consumer that’s connected to NATS but can’t reach its database is functionally dead. Health checks should probe the entire processing path.

Set up alerting on ack age across all consumers:

Use Synadia Insights for automated threshold monitoring. Insights evaluates the last-ack-critical metadata across your entire fleet every collection epoch, correlating ack gaps with other consumer health signals like redelivery rates and pending buildup.

Add circuit breakers for downstream dependencies. When a dependency fails, stop fetching new messages rather than accumulating them in ack pending. This prevents hitting max_ack_pending and keeps redelivery counts low.

Frequently asked questions

How do I set the last-ack-critical threshold?

Set it as consumer metadata:

nats consumer add STREAM_NAME CONSUMER_NAME \
  --metadata "io.nats.monitor.last-ack-critical=5m"

Choose a threshold based on your processing SLA. If messages should be processed within 30 seconds under normal conditions, a 5-minute threshold gives enough buffer for transient delays while catching real stalls.

What’s the difference between last ack critical and ack pending buildup?

Ack pending buildup (CONSUMER_001) measures the count of messages awaiting acknowledgment relative to a threshold. Last ack critical measures the time since any ack was received. A consumer could have a moderate ack pending count (not triggering buildup alerts) but still have an ack gap if all pending messages are stuck. The two checks complement each other: buildup measures volume, last ack measures recency.

Can a consumer have a healthy last delivery but stale last ack?

Yes, and this is a key diagnostic signal. It means the server is delivering messages but the consumer application is not processing them successfully. This points to an application-level problem — a downstream dependency failure, a processing bug, or resource exhaustion in the consumer process.

Does this check apply to consumers with AckNone policy?

No. Consumers configured with AckNone do not send acknowledgments by design. The last-ack-critical threshold is only meaningful for consumers with AckExplicit or AckAll policies. Setting this metadata on an AckNone consumer would produce a permanent alert.

What happens when ack_wait expires on unacknowledged messages?

The server marks the message for redelivery and delivers it again to the next available subscriber. The original delivery is counted as a redelivery in the consumer’s stats. If the redelivery also fails to produce an ack, the cycle repeats until max_deliver is reached (if configured), at which point the message is dropped from the consumer’s perspective.

FEATURED

RESOURCES

Comparisons