NATS WorkQueue Discard New with Aggressive Consumer Settings: Silent Message Loss

A WorkQueue stream with discard_policy: new rejects incoming messages when the stream is full. This is by design — it provides backpressure to publishers. But when consumers on that stream have aggressive settings — short ack_wait or low max_deliver — messages can be nacked and permanently discarded before they’re successfully processed. The stream stays full, publishers are rejected, and messages that were in the stream are lost. Neither the publisher nor the consumer sees an explicit error. This is silent data loss.

Why this matters

The interaction between three settings creates a trap:

retention: workqueue — Messages are deleted after acknowledgment. Each message is processed exactly once (or discarded after exceeding delivery limits).
discard: new — When the stream hits its max_msgs or max_bytes limit, new publishes are rejected rather than old messages being evicted. This protects existing messages at the cost of publisher throughput.
Low max_deliver or short ack_wait on consumers — If a consumer fails to acknowledge a message within ack_wait, the message is redelivered. After max_deliver attempts, the message is terminally discarded from the WorkQueue.

The dangerous scenario: a consumer is temporarily slow or experiencing errors. Messages time out (ack_wait exceeded), get redelivered, time out again. After max_deliver attempts (say, 3), the message is permanently discarded. Meanwhile, the stream is at capacity and rejecting new publishes. The result: old messages are silently lost through the delivery limit, and new messages can’t enter the stream. The system appears stuck, but data is actually being destroyed.

With discard: old (the default), this interaction is less dangerous because the stream evicts the oldest unprocessed messages to make room for new ones. You still lose messages, but the loss is visible — old messages disappear and new ones flow in. With discard: new, the loss is hidden behind redelivery exhaustion.

Common causes

Default consumer settings with defensive stream config. An operator configures the stream with discard: new to prevent data loss from eviction but doesn’t consider how consumer delivery limits interact with that policy.
Copy-pasted consumer configurations. max_deliver: 3 and ack_wait: 5s are common in example configurations and tutorials. They’re reasonable for idempotent operations where retries are cheap, but dangerous on a WorkQueue with discard: new.
Transient consumer failures. A downstream dependency (database, API) goes down. The consumer can’t process messages. With ack_wait: 5s and max_deliver: 3, every in-flight message is permanently discarded within 15 seconds of the dependency failure.
Consumer processing time exceeds ack_wait. The consumer takes 10 seconds to process a message but ack_wait is set to 5 seconds. Every message times out mid-processing, gets redelivered (potentially causing duplicate processing), and eventually exhausts max_deliver.
Scaling-down consumers without adjusting settings. Reducing the number of consumer instances increases per-instance message volume. Processing time per message increases, exceeding ack_wait, triggering the redelivery-to-discard cascade.

How to diagnose

Identify WorkQueue streams with discard: new

nats stream list --json | jq '.[] | select(.config.retention == "workqueue" and .config.discard == "new") | {name: .config.name, max_msgs: .config.max_msgs, max_bytes: .config.max_bytes}'

Check consumer delivery settings

For each WorkQueue stream identified above, examine consumer configurations:

nats consumer list MY_WORKQUEUE --json | jq '.[] | {name: .config.name, ack_wait: .config.ack_wait, max_deliver: .config.max_deliver}'

Flag consumers where ack_wait is under 30 seconds or max_deliver is under 10.

Look for evidence of message loss

Check the stream’s num_deleted counter — on a WorkQueue, this includes messages discarded after exceeding max_deliver:

nats stream info MY_WORKQUEUE --json | jq '{messages: .state.messages, deleted: .state.num_deleted, consumers: .state.consumer_count}'

A high num_deleted count on a WorkQueue combined with discard: new is a strong indicator that messages are being lost through delivery exhaustion.

Monitor redelivery rate

High redelivery rates indicate consumers are failing to ack within ack_wait:

nats consumer info MY_WORKQUEUE MY_CONSUMER --json | jq '{
  ack_pending: .num_ack_pending,
  redelivered: .num_redelivered,
  waiting: .num_waiting
}'

If num_redelivered is high relative to total processed messages, the ack_wait is likely too aggressive.

Check for publisher rejections

Publishers will see maximum messages exceeded or maximum bytes exceeded errors when the stream is full:

# Check for JS API errors
nats server report jetstream --json | jq '.[] | {server: .name, api_errors: .stats.api.errors}'

Simultaneous publisher rejections and consumer redeliveries confirm the trap is active.

How to fix it

Increase ack_wait and max_deliver on consumers

The most direct fix is to give consumers more time and more attempts:

# Update consumer settings
nats consumer edit MY_WORKQUEUE MY_CONSUMER --wait 60s --max-deliver 20

Recommended minimums for WorkQueue streams with discard: new:

ack_wait: At least 30 seconds, preferably 60 seconds or more. Should be at least 2x the maximum expected processing time.
max_deliver: At least 10, preferably 20. This gives the system time to recover from transient failures.

In Go:

1
js, _ := nc.JetStream()
2
_, err := js.AddConsumer("MY_WORKQUEUE", &nats.ConsumerConfig{
3
    Durable:    "MY_CONSUMER",
4
    AckPolicy:  nats.AckExplicitPolicy,
5
    AckWait:    60 * time.Second,
6
    MaxDeliver: 20,
7
})

In Python:

1
import nats
2
from nats.js.api import ConsumerConfig
3

4
nc = await nats.connect()
5
js = nc.jetstream()
6

7
await js.add_consumer(
8
    "MY_WORKQUEUE",
9
    config=ConsumerConfig(
10
        durable_name="MY_CONSUMER",
11
        ack_wait=60,  # seconds
12
        max_deliver=20,
13
    ),
14
)

Use in-progress acknowledgments

For long-running processing, send periodic InProgress (work-in-progress) acknowledgments to prevent ack_wait timeouts:

1
sub, _ := js.PullSubscribe("", "MY_CONSUMER", nats.BindStream("MY_WORKQUEUE"))
2
msgs, _ := sub.Fetch(1)
3
for _, msg := range msgs {
4
    // Start processing
5
    go func(m *nats.Msg) {
6
        ticker := time.NewTicker(10 * time.Second)
7
        defer ticker.Stop()
8
        done := make(chan struct{})
9

10
        go func() {
11
            processMessage(m)  // long-running work
12
            close(done)
13
        }()
14

15
        for {
16
            select {
17
            case <-ticker.C:
18
                m.InProgress()  // reset ack_wait timer
19
            case <-done:
20
                m.Ack()
21
                return
22
            }
23
        }
24
    }(msg)
25
}

1
import asyncio
2

3
async def process_with_heartbeat(msg):
4
    """Process message with periodic InProgress acks."""
5
    async def heartbeat():
6
        while True:
7
            await asyncio.sleep(10)
8
            await msg.in_progress()
9

10
    task = asyncio.create_task(heartbeat())
11
    try:
12
        await process_message(msg)  # long-running work
13
        await msg.ack()
14
    finally:
15
        task.cancel()

Consider switching discard policy

If the primary concern is that new messages shouldn’t be lost (rather than old messages), switch to discard: old:

nats stream edit MY_WORKQUEUE --discard old

With discard: old, when the stream is full, the oldest message is evicted to make room. This loses old unprocessed messages but ensures new messages are always accepted. The trade-off depends on whether old or new messages are more valuable to your use case.

Configure a dead letter subject

Use max_deliver with a dead letter (advisory) handler so that messages exceeding delivery limits are captured rather than silently lost:

1
// Subscribe to delivery exceeded advisories
2
nc.Subscribe("$JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES.MY_WORKQUEUE.MY_CONSUMER", func(msg *nats.Msg) {
3
    log.Printf("Message exceeded max deliveries: %s", string(msg.Data))
4
    // Store in dead letter stream, alert ops team, etc.
5
})

Set appropriate stream limits

Ensure the stream’s max_msgs or max_bytes is large enough that transient consumer slowdowns don’t immediately fill the stream and trigger publisher rejections:

# Give more headroom
nats stream edit MY_WORKQUEUE --max-msgs 1000000 --max-bytes 10GB

Frequently asked questions

Why doesn’t NATS warn about this combination?

The individual settings are each valid and useful. discard: new prevents silent eviction of old messages. Low max_deliver prevents infinite retry loops. Short ack_wait enables fast redelivery. The problem is the interaction between all three, which depends on the operational context (consumer reliability, processing time, failure patterns). Insights detects this specific combination and alerts you proactively.

Is this only a problem when the stream is full?

The discard: new policy only activates when the stream reaches its limits. If the stream never fills up, publishers aren’t rejected and the consumer delivery settings are the only concern. But WorkQueue streams tend toward fullness under load because messages accumulate when consumers are slow — exactly the scenario where aggressive delivery settings cause the most damage.

What’s the right max_deliver for a WorkQueue?

It depends on your failure modes. If consumer failures are transient (dependency timeouts, temporary errors), a higher max_deliver (10–20) gives the system time to recover. If failures are permanent (bad message format, missing data), no amount of retries will help — and you need a dead letter mechanism. For most WorkQueue patterns, max_deliver: 20 with ack_wait: 60s is a safe starting point.

Can I use nak with delay instead of relying on ack_wait?

Yes. Explicit nak with a backoff delay gives you more control than ack_wait timeouts. The consumer can decide how long to wait before redelivery based on the specific error. This is preferable to relying on ack_wait for redelivery timing, but you should still set a generous ack_wait as a safety net for cases where the consumer crashes without sending a nak.

Does this affect pull consumers differently than push consumers?

The underlying mechanics are the same — both consumer types are subject to ack_wait timeouts and max_deliver limits. Pull consumers have an advantage: since the client explicitly fetches messages, it has more control over how many messages are in flight. Push consumers can have messages delivered faster than they can process, making ack_wait timeouts more likely under load.

FEATURED

RESOURCES

Comparisons