A WorkQueue stream with discard_policy: new rejects incoming messages when the stream is full. This is by design — it provides backpressure to publishers. But when consumers on that stream have aggressive settings — short ack_wait or low max_deliver — messages can be nacked and permanently discarded before they’re successfully processed. The stream stays full, publishers are rejected, and messages that were in the stream are lost. Neither the publisher nor the consumer sees an explicit error. This is silent data loss.
The interaction between three settings creates a trap:
retention: workqueue — Messages are deleted after acknowledgment. Each message is processed exactly once (or discarded after exceeding delivery limits).
discard: new — When the stream hits its max_msgs or max_bytes limit, new publishes are rejected rather than old messages being evicted. This protects existing messages at the cost of publisher throughput.
Low max_deliver or short ack_wait on consumers — If a consumer fails to acknowledge a message within ack_wait, the message is redelivered. After max_deliver attempts, the message is terminally discarded from the WorkQueue.
The dangerous scenario: a consumer is temporarily slow or experiencing errors. Messages time out (ack_wait exceeded), get redelivered, time out again. After max_deliver attempts (say, 3), the message is permanently discarded. Meanwhile, the stream is at capacity and rejecting new publishes. The result: old messages are silently lost through the delivery limit, and new messages can’t enter the stream. The system appears stuck, but data is actually being destroyed.
With discard: old (the default), this interaction is less dangerous because the stream evicts the oldest unprocessed messages to make room for new ones. You still lose messages, but the loss is visible — old messages disappear and new ones flow in. With discard: new, the loss is hidden behind redelivery exhaustion.
Default consumer settings with defensive stream config. An operator configures the stream with discard: new to prevent data loss from eviction but doesn’t consider how consumer delivery limits interact with that policy.
Copy-pasted consumer configurations. max_deliver: 3 and ack_wait: 5s are common in example configurations and tutorials. They’re reasonable for idempotent operations where retries are cheap, but dangerous on a WorkQueue with discard: new.
Transient consumer failures. A downstream dependency (database, API) goes down. The consumer can’t process messages. With ack_wait: 5s and max_deliver: 3, every in-flight message is permanently discarded within 15 seconds of the dependency failure.
Consumer processing time exceeds ack_wait. The consumer takes 10 seconds to process a message but ack_wait is set to 5 seconds. Every message times out mid-processing, gets redelivered (potentially causing duplicate processing), and eventually exhausts max_deliver.
Scaling-down consumers without adjusting settings. Reducing the number of consumer instances increases per-instance message volume. Processing time per message increases, exceeding ack_wait, triggering the redelivery-to-discard cascade.
nats stream list --json | jq '.[] | select(.config.retention == "workqueue" and .config.discard == "new") | {name: .config.name, max_msgs: .config.max_msgs, max_bytes: .config.max_bytes}'For each WorkQueue stream identified above, examine consumer configurations:
nats consumer list MY_WORKQUEUE --json | jq '.[] | {name: .config.name, ack_wait: .config.ack_wait, max_deliver: .config.max_deliver}'Flag consumers where ack_wait is under 30 seconds or max_deliver is under 10.
Check the stream’s num_deleted counter — on a WorkQueue, this includes messages discarded after exceeding max_deliver:
nats stream info MY_WORKQUEUE --json | jq '{messages: .state.messages, deleted: .state.num_deleted, consumers: .state.consumer_count}'A high num_deleted count on a WorkQueue combined with discard: new is a strong indicator that messages are being lost through delivery exhaustion.
High redelivery rates indicate consumers are failing to ack within ack_wait:
nats consumer info MY_WORKQUEUE MY_CONSUMER --json | jq '{ ack_pending: .num_ack_pending, redelivered: .num_redelivered, waiting: .num_waiting}'If num_redelivered is high relative to total processed messages, the ack_wait is likely too aggressive.
Publishers will see maximum messages exceeded or maximum bytes exceeded errors when the stream is full:
# Check for JS API errorsnats server report jetstream --json | jq '.[] | {server: .name, api_errors: .stats.api.errors}'Simultaneous publisher rejections and consumer redeliveries confirm the trap is active.
The most direct fix is to give consumers more time and more attempts:
# Update consumer settingsnats consumer edit MY_WORKQUEUE MY_CONSUMER --wait 60s --max-deliver 20Recommended minimums for WorkQueue streams with discard: new:
ack_wait: At least 30 seconds, preferably 60 seconds or more. Should be at least 2x the maximum expected processing time.max_deliver: At least 10, preferably 20. This gives the system time to recover from transient failures.In Go:
1js, _ := nc.JetStream()2_, err := js.AddConsumer("MY_WORKQUEUE", &nats.ConsumerConfig{3 Durable: "MY_CONSUMER",4 AckPolicy: nats.AckExplicitPolicy,5 AckWait: 60 * time.Second,6 MaxDeliver: 20,7})In Python:
1import nats2from nats.js.api import ConsumerConfig3
4nc = await nats.connect()5js = nc.jetstream()6
7await js.add_consumer(8 "MY_WORKQUEUE",9 config=ConsumerConfig(10 durable_name="MY_CONSUMER",11 ack_wait=60, # seconds12 max_deliver=20,13 ),14)For long-running processing, send periodic InProgress (work-in-progress) acknowledgments to prevent ack_wait timeouts:
1sub, _ := js.PullSubscribe("", "MY_CONSUMER", nats.BindStream("MY_WORKQUEUE"))2msgs, _ := sub.Fetch(1)3for _, msg := range msgs {4 // Start processing5 go func(m *nats.Msg) {6 ticker := time.NewTicker(10 * time.Second)7 defer ticker.Stop()8 done := make(chan struct{})9
10 go func() {11 processMessage(m) // long-running work12 close(done)13 }()14
15 for {16 select {17 case <-ticker.C:18 m.InProgress() // reset ack_wait timer19 case <-done:20 m.Ack()21 return22 }23 }24 }(msg)25}1import asyncio2
3async def process_with_heartbeat(msg):4 """Process message with periodic InProgress acks."""5 async def heartbeat():6 while True:7 await asyncio.sleep(10)8 await msg.in_progress()9
10 task = asyncio.create_task(heartbeat())11 try:12 await process_message(msg) # long-running work13 await msg.ack()14 finally:15 task.cancel()If the primary concern is that new messages shouldn’t be lost (rather than old messages), switch to discard: old:
nats stream edit MY_WORKQUEUE --discard oldWith discard: old, when the stream is full, the oldest message is evicted to make room. This loses old unprocessed messages but ensures new messages are always accepted. The trade-off depends on whether old or new messages are more valuable to your use case.
Use max_deliver with a dead letter (advisory) handler so that messages exceeding delivery limits are captured rather than silently lost:
1// Subscribe to delivery exceeded advisories2nc.Subscribe("$JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES.MY_WORKQUEUE.MY_CONSUMER", func(msg *nats.Msg) {3 log.Printf("Message exceeded max deliveries: %s", string(msg.Data))4 // Store in dead letter stream, alert ops team, etc.5})Ensure the stream’s max_msgs or max_bytes is large enough that transient consumer slowdowns don’t immediately fill the stream and trigger publisher rejections:
# Give more headroomnats stream edit MY_WORKQUEUE --max-msgs 1000000 --max-bytes 10GBThe individual settings are each valid and useful. discard: new prevents silent eviction of old messages. Low max_deliver prevents infinite retry loops. Short ack_wait enables fast redelivery. The problem is the interaction between all three, which depends on the operational context (consumer reliability, processing time, failure patterns). Insights detects this specific combination and alerts you proactively.
The discard: new policy only activates when the stream reaches its limits. If the stream never fills up, publishers aren’t rejected and the consumer delivery settings are the only concern. But WorkQueue streams tend toward fullness under load because messages accumulate when consumers are slow — exactly the scenario where aggressive delivery settings cause the most damage.
It depends on your failure modes. If consumer failures are transient (dependency timeouts, temporary errors), a higher max_deliver (10–20) gives the system time to recover. If failures are permanent (bad message format, missing data), no amount of retries will help — and you need a dead letter mechanism. For most WorkQueue patterns, max_deliver: 20 with ack_wait: 60s is a safe starting point.
Yes. Explicit nak with a backoff delay gives you more control than ack_wait timeouts. The consumer can decide how long to wait before redelivery based on the specific error. This is preferable to relying on ack_wait for redelivery timing, but you should still set a generous ack_wait as a safety net for cases where the consumer crashes without sending a nak.
The underlying mechanics are the same — both consumer types are subject to ack_wait timeouts and max_deliver limits. Pull consumers have an advantage: since the client explicitly fetches messages, it has more control over how many messages are in flight. Push consumers can have messages delivered faster than they can process, making ack_wait timeouts more likely under load.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community