JetStream streams accumulate messages based on their publish rate and retention policy. When the message count in a stream crosses an operator-defined threshold — set via io.nats.monitor.msgs-warn or io.nats.monitor.msgs-critical metadata tags — this check fires. The threshold direction is inferred automatically: if msgs-warn is lower than msgs-critical, the check treats both as upper bounds (too many messages). If msgs-warn is higher than msgs-critical, the check treats them as lower bounds (too few messages). This flexibility supports both overflow detection and upstream dryness monitoring from a single check.
An unexpectedly high message count signals that messages are accumulating faster than they are being consumed or removed. This is a leading indicator of several operational problems. Storage will eventually be exhausted, either hitting the stream’s configured limit (causing oldest messages to be discarded under limits-based retention) or consuming available JetStream storage on the server (affecting all streams on that server). High message counts also slow down consumer recovery — a consumer that needs to replay from the beginning of a stream with millions of messages faces significant latency before reaching the current state.
An unexpectedly low message count is equally concerning. If a stream that normally holds hundreds of thousands of messages suddenly drops to near-zero, it typically means upstream publishers have stopped, the stream’s retention policy is too aggressive, or an operator purged the stream accidentally. For streams that serve as the source of truth for downstream consumers, a low message count means those consumers have less historical data available for replay.
The message count threshold check provides a bidirectional guardrail: it alerts when the stream’s volume deviates from the operator’s expectation in either direction. This makes it a versatile tool for monitoring stream health across a wide range of use cases — from high-volume event streams to low-volume command streams where every message matters.
Consumers are not keeping up. Under work-queue retention, messages are removed after acknowledgment. If consumers are slow, paused, or crashed, messages accumulate. This is the most common cause of message count growth in work-queue streams.
Retention policy is too permissive. A limits-based stream with no max-messages or max-bytes configuration will grow indefinitely. If the stream was created without explicit limits, messages accumulate until JetStream storage is exhausted.
Publish rate spike. A burst of upstream traffic — batch imports, backfills, or a runaway publisher — injects messages faster than the retention policy removes them. The message count spikes temporarily.
Interest-based retention with no consumers. Streams using interest-based retention only remove messages when all consumers have acknowledged them. If no consumers are defined, messages are never removed.
Consumer acknowledgment failures. A consumer is running but failing to acknowledge messages (application bugs, downstream errors). Under work-queue retention, unacknowledged messages remain in the stream indefinitely.
Upstream publishers stopped. The services that publish to the stream are down, misconfigured, or have been redirected to a different subject. No new messages arrive, and retention policy continues to remove old ones.
Accidental stream purge. An operator or automated process ran nats stream purge on the stream, removing all messages. In production, this can happen during incident response when the wrong stream is targeted.
Retention policy too aggressive. The stream’s max-age, max-messages, or max-bytes limits are configured too tightly, causing messages to be discarded before they are expected to expire.
Subject filter mismatch. The stream’s subject bindings were changed, and incoming messages no longer match the stream’s subject filter. The stream appears dry even though publishers are active on the old subjects.
nats stream info MY_STREAM --json | jq '{messages: .state.messages, bytes: .state.bytes, first_seq: .state.first_seq, last_seq: .state.last_seq}'nats stream info MY_STREAM --json | jq '.config.metadata | with_entries(select(.key | startswith("io.nats.monitor.msgs")))'# Sample message count over timefor i in $(seq 1 5); do echo "$(date '+%H:%M:%S') $(nats stream info MY_STREAM --json | jq '.state.messages')" sleep 10doneA growing count with a flat or absent consumer delivery rate indicates consumer problems. A shrinking count with no publisher activity indicates upstream issues.
nats consumer ls MY_STREAMnats consumer info MY_STREAM CONSUMER_NAMELook at num_pending (messages waiting to be delivered) and num_ack_pending (delivered but not acknowledged). High values indicate the consumer is not processing messages effectively.
# Check if messages are being published to the stream's subjectsnats sub "my.subject.>" --count 5 --timeout 10sIf no messages arrive within the timeout, publishers are not active on the stream’s subjects.
1js, _ := nc.JetStream()2info, _ := js.StreamInfo("MY_STREAM")3
4msgsWarn := uint64(100000)5msgsCritical := uint64(500000)6
7if info.State.Msgs > msgsCritical {8 log.Printf("CRITICAL: stream has %d messages (critical threshold: %d)",9 info.State.Msgs, msgsCritical)10} else if info.State.Msgs > msgsWarn {11 log.Printf("WARNING: stream has %d messages (warn threshold: %d)",12 info.State.Msgs, msgsWarn)13}Fix the consumers first. If consumers are paused, crashed, or slow, restoring consumer health is the primary fix. Messages will be processed and (under work-queue retention) removed:
# Check consumer statenats consumer info MY_STREAM CONSUMER_NAME
# If the consumer is paused, resume itnats consumer resume MY_STREAM CONSUMER_NAMEAdd retention limits. If the stream has no message or byte limits, add them to prevent unbounded growth:
nats stream edit MY_STREAM --max-msgs 1000000 --max-bytes 10GBPurge selectively. If immediate cleanup is needed, purge messages older than a certain sequence or by subject:
# Purge messages up to a specific sequencenats stream purge MY_STREAM --seq 50000
# Purge by subjectnats stream purge MY_STREAM --subject "events.debug.>"Scale consumers. If the consumer cannot keep up with the publish rate, add more consumer instances using pull consumers or queue groups to increase throughput.
Verify publisher health. Check that upstream services are running and publishing to the correct subjects. A quick smoke test:
# Publish a test messagenats pub my.subject "test message"
# Verify it appears in the streamnats stream info MY_STREAM --json | jq '.state.messages'Check subject configuration. Ensure the stream’s subjects match what publishers are sending:
nats stream info MY_STREAM --json | jq '.config.subjects'Review retention policy. If messages are being removed too quickly, relax the retention limits:
nats stream edit MY_STREAM --max-age 72h --max-msgs 500000Configure warn and critical thresholds:
# Upper bounds: warn at 100k, critical at 500knats stream edit MY_STREAM \ --metadata "io.nats.monitor.msgs-warn=100000" \ --metadata "io.nats.monitor.msgs-critical=500000"# Lower bounds: warn below 1000, critical below 100nats stream edit MY_STREAM \ --metadata "io.nats.monitor.msgs-warn=1000" \ --metadata "io.nats.monitor.msgs-critical=100"The direction (upper vs. lower bound) is inferred from the relative values: if warn < critical, they are upper bounds. If warn > critical, they are lower bounds.
Combine with other stream checks. Use message count thresholds alongside Stream Byte Limit (JETSTREAM_013), Stream Message Limit (JETSTREAM_014), and Storage Pressure (JETSTREAM_002) for comprehensive stream capacity monitoring.
The check compares msgs-warn and msgs-critical values. If msgs-warn < msgs-critical, both are treated as upper bounds — the check fires when the message count exceeds either value. If msgs-warn > msgs-critical, both are treated as lower bounds — the check fires when the message count drops below either value. This allows the same metadata keys to serve both “too many” and “too few” use cases without additional configuration.
Use both. Message count thresholds monitor the number of messages, which is useful when each message represents a discrete work item (orders, events, commands). Byte-based limits monitor total storage consumption, which matters when message sizes vary significantly. A stream might have a normal message count but abnormal storage use if message sizes increased. The two approaches complement each other.
Under limits-based retention, the oldest messages are automatically discarded to make room for new ones. The stream’s message count stays at or near the limit. The message-count threshold check is separate from the stream’s built-in limits — it fires based on operator-defined thresholds that may be set below the stream’s hard limit to provide early warning before messages start being discarded.
Yes. If a runaway publisher is injecting messages at an abnormally high rate, the message count will grow faster than normal. An upper-bound threshold (msgs-warn/msgs-critical) catches this growth before it becomes a storage emergency. For more immediate detection of publish rate anomalies, combine this check with request-rate monitoring via the JetStream API metrics.
Review thresholds quarterly or whenever the stream’s traffic pattern changes significantly — new publishers, changes in message size, retention policy adjustments, or consumer scaling events. Thresholds that were appropriate at one traffic level may be too tight or too loose after a system change. Synadia Insights evaluates these thresholds continuously, so changes take effect on the next collection epoch.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community