NATS Message Count Threshold: Detecting Abnormal Stream Message Volumes

JetStream streams accumulate messages based on their publish rate and retention policy. When the message count in a stream crosses an operator-defined threshold — set via io.nats.monitor.msgs-warn or io.nats.monitor.msgs-critical metadata tags — this check fires. The threshold direction is inferred automatically: if msgs-warn is lower than msgs-critical, the check treats both as upper bounds (too many messages). If msgs-warn is higher than msgs-critical, the check treats them as lower bounds (too few messages). This flexibility supports both overflow detection and upstream dryness monitoring from a single check.

Why this matters

An unexpectedly high message count signals that messages are accumulating faster than they are being consumed or removed. This is a leading indicator of several operational problems. Storage will eventually be exhausted, either hitting the stream’s configured limit (causing oldest messages to be discarded under limits-based retention) or consuming available JetStream storage on the server (affecting all streams on that server). High message counts also slow down consumer recovery — a consumer that needs to replay from the beginning of a stream with millions of messages faces significant latency before reaching the current state.

An unexpectedly low message count is equally concerning. If a stream that normally holds hundreds of thousands of messages suddenly drops to near-zero, it typically means upstream publishers have stopped, the stream’s retention policy is too aggressive, or an operator purged the stream accidentally. For streams that serve as the source of truth for downstream consumers, a low message count means those consumers have less historical data available for replay.

The message count threshold check provides a bidirectional guardrail: it alerts when the stream’s volume deviates from the operator’s expectation in either direction. This makes it a versatile tool for monitoring stream health across a wide range of use cases — from high-volume event streams to low-volume command streams where every message matters.

Common causes

Too many messages

Consumers are not keeping up. Under work-queue retention, messages are removed after acknowledgment. If consumers are slow, paused, or crashed, messages accumulate. This is the most common cause of message count growth in work-queue streams.
Retention policy is too permissive. A limits-based stream with no max-messages or max-bytes configuration will grow indefinitely. If the stream was created without explicit limits, messages accumulate until JetStream storage is exhausted.
Publish rate spike. A burst of upstream traffic — batch imports, backfills, or a runaway publisher — injects messages faster than the retention policy removes them. The message count spikes temporarily.
Interest-based retention with no consumers. Streams using interest-based retention only remove messages when all consumers have acknowledged them. If no consumers are defined, messages are never removed.
Consumer acknowledgment failures. A consumer is running but failing to acknowledge messages (application bugs, downstream errors). Under work-queue retention, unacknowledged messages remain in the stream indefinitely.

Too few messages

Upstream publishers stopped. The services that publish to the stream are down, misconfigured, or have been redirected to a different subject. No new messages arrive, and retention policy continues to remove old ones.
Accidental stream purge. An operator or automated process ran nats stream purge on the stream, removing all messages. In production, this can happen during incident response when the wrong stream is targeted.
Retention policy too aggressive. The stream’s max-age, max-messages, or max-bytes limits are configured too tightly, causing messages to be discarded before they are expected to expire.
Subject filter mismatch. The stream’s subject bindings were changed, and incoming messages no longer match the stream’s subject filter. The stream appears dry even though publishers are active on the old subjects.

How to diagnose

Check the current message count

nats stream info MY_STREAM --json | jq '{messages: .state.messages, bytes: .state.bytes, first_seq: .state.first_seq, last_seq: .state.last_seq}'

View the configured thresholds

nats stream info MY_STREAM --json | jq '.config.metadata | with_entries(select(.key | startswith("io.nats.monitor.msgs")))'

Determine if the count is growing or shrinking

# Sample message count over time
for i in $(seq 1 5); do
  echo "$(date '+%H:%M:%S') $(nats stream info MY_STREAM --json | jq '.state.messages')"
  sleep 10
done

A growing count with a flat or absent consumer delivery rate indicates consumer problems. A shrinking count with no publisher activity indicates upstream issues.

Check consumer health (for high message counts)

nats consumer ls MY_STREAM
nats consumer info MY_STREAM CONSUMER_NAME

Look at num_pending (messages waiting to be delivered) and num_ack_pending (delivered but not acknowledged). High values indicate the consumer is not processing messages effectively.

Check publisher activity (for low message counts)

# Check if messages are being published to the stream's subjects
nats sub "my.subject.>" --count 5 --timeout 10s

If no messages arrive within the timeout, publishers are not active on the stream’s subjects.

Programmatic detection in Go

1
js, _ := nc.JetStream()
2
info, _ := js.StreamInfo("MY_STREAM")
3

4
msgsWarn := uint64(100000)
5
msgsCritical := uint64(500000)
6

7
if info.State.Msgs > msgsCritical {
8
    log.Printf("CRITICAL: stream has %d messages (critical threshold: %d)",
9
        info.State.Msgs, msgsCritical)
10
} else if info.State.Msgs > msgsWarn {
11
    log.Printf("WARNING: stream has %d messages (warn threshold: %d)",
12
        info.State.Msgs, msgsWarn)
13
}

Programmatic detection in Python

How to fix it

Too many messages: reduce the count

Fix the consumers first. If consumers are paused, crashed, or slow, restoring consumer health is the primary fix. Messages will be processed and (under work-queue retention) removed:

# Check consumer state
nats consumer info MY_STREAM CONSUMER_NAME

# If the consumer is paused, resume it
nats consumer resume MY_STREAM CONSUMER_NAME

Add retention limits. If the stream has no message or byte limits, add them to prevent unbounded growth:

nats stream edit MY_STREAM --max-msgs 1000000 --max-bytes 10GB

Purge selectively. If immediate cleanup is needed, purge messages older than a certain sequence or by subject:

# Purge messages up to a specific sequence
nats stream purge MY_STREAM --seq 50000

# Purge by subject
nats stream purge MY_STREAM --subject "events.debug.>"

Scale consumers. If the consumer cannot keep up with the publish rate, add more consumer instances using pull consumers or queue groups to increase throughput.

Too few messages: restore the flow

Verify publisher health. Check that upstream services are running and publishing to the correct subjects. A quick smoke test:

# Publish a test message
nats pub my.subject "test message"

# Verify it appears in the stream
nats stream info MY_STREAM --json | jq '.state.messages'

Check subject configuration. Ensure the stream’s subjects match what publishers are sending:

nats stream info MY_STREAM --json | jq '.config.subjects'

Review retention policy. If messages are being removed too quickly, relax the retention limits:

nats stream edit MY_STREAM --max-age 72h --max-msgs 500000

Preventive: set monitoring thresholds

Configure warn and critical thresholds:

# Upper bounds: warn at 100k, critical at 500k
nats stream edit MY_STREAM \
  --metadata "io.nats.monitor.msgs-warn=100000" \
  --metadata "io.nats.monitor.msgs-critical=500000"

# Lower bounds: warn below 1000, critical below 100
nats stream edit MY_STREAM \
  --metadata "io.nats.monitor.msgs-warn=1000" \
  --metadata "io.nats.monitor.msgs-critical=100"

The direction (upper vs. lower bound) is inferred from the relative values: if warn < critical, they are upper bounds. If warn > critical, they are lower bounds.

Combine with other stream checks. Use message count thresholds alongside Stream Byte Limit (JETSTREAM_013), Stream Message Limit (JETSTREAM_014), and Storage Pressure (JETSTREAM_002) for comprehensive stream capacity monitoring.

Frequently asked questions

How does the threshold direction inference work?

The check compares msgs-warn and msgs-critical values. If msgs-warn < msgs-critical, both are treated as upper bounds — the check fires when the message count exceeds either value. If msgs-warn > msgs-critical, both are treated as lower bounds — the check fires when the message count drops below either value. This allows the same metadata keys to serve both “too many” and “too few” use cases without additional configuration.

Should I use message count thresholds or byte-based limits?

Use both. Message count thresholds monitor the number of messages, which is useful when each message represents a discrete work item (orders, events, commands). Byte-based limits monitor total storage consumption, which matters when message sizes vary significantly. A stream might have a normal message count but abnormal storage use if message sizes increased. The two approaches complement each other.

What happens when a stream hits its configured max-messages limit?

Under limits-based retention, the oldest messages are automatically discarded to make room for new ones. The stream’s message count stays at or near the limit. The message-count threshold check is separate from the stream’s built-in limits — it fires based on operator-defined thresholds that may be set below the stream’s hard limit to provide early warning before messages start being discarded.

Can this check detect a runaway publisher?

Yes. If a runaway publisher is injecting messages at an abnormally high rate, the message count will grow faster than normal. An upper-bound threshold (msgs-warn/msgs-critical) catches this growth before it becomes a storage emergency. For more immediate detection of publish rate anomalies, combine this check with request-rate monitoring via the JetStream API metrics.

How often should I review and adjust thresholds?

Review thresholds quarterly or whenever the stream’s traffic pattern changes significantly — new publishers, changes in message size, retention policy adjustments, or consumer scaling events. Thresholds that were appropriate at one traffic level may be too tight or too loose after a system change. Synadia Insights evaluates these thresholds continuously, so changes take effect on the next collection epoch.

FEATURED

RESOURCES

Comparisons