Checks/OPT_SYS_025

NATS Sustained Consumer Growth on Stream: Detecting Consumer Leaks

Severity
Warning
Category
Errors
Applies to
JetStream
Check ID
OPT_SYS_025
Detection threshold
Consumer count on a stream growing steadily over the lookback window

A consumer leak happens when an application keeps creating new JetStream consumers without cleaning up old ones. The stream’s consumer count climbs steadily — 100, then 500, then 2,000 — even though the application’s logical consumer count hasn’t changed. Each leaked consumer consumes server memory, maintains its own Raft group (for R3 streams), and adds to the metadata overhead. Eventually, the leak degrades cluster performance, exhausts consumer limits, or destabilizes Raft.

Why this matters

Every JetStream consumer is a stateful object. The server tracks the consumer’s configuration, its ack floor (which messages have been acknowledged), its pending set (which messages are in flight), and its last activity timestamp. For replicated streams, each consumer has its own Raft group for state replication. This means every leaked consumer creates:

  • A Raft group with its own WAL, leader election, and heartbeat traffic (for R3 streams, that’s 3 Raft groups per consumer across the cluster)
  • Memory for state tracking — pending messages, ack bitmaps, delivery tracking
  • Metadata overhead — the consumer definition is replicated via the meta Raft group

At tens of consumers, this is negligible. At thousands, it’s a measurable load on the cluster. The Raft heartbeat traffic alone — each consumer’s Raft group sending heartbeats every 200ms — can saturate internal routes. The meta leader, which tracks all JetStream assets, slows down as the asset count grows. Stream operations that enumerate consumers (like nats consumer list) become slow.

The most common source is ephemeral consumers. An ephemeral consumer is designed to be temporary — it has no durable name and is automatically cleaned up after inactive_threshold (default: 5 seconds of inactivity). But if the application creates new ephemeral consumers faster than old ones expire, or if inactive_threshold is set too high, the consumer count grows monotonically.

The pattern often looks like this: a service restarts and creates new ephemeral consumers. The old consumers from the previous instance haven’t expired yet (or inactive_threshold is set to hours or days). Each restart adds another batch of consumers. Over a deployment cycle with frequent restarts — rolling updates, autoscaling events, crash loops — the consumer count balloons.

Common causes

  • Ephemeral consumers without proper inactive_threshold. The application creates ephemeral consumers with a long inactive_threshold (or relies on the default for older server versions). Old consumers linger long after the creating client disconnects.

  • Application restart creating new consumers. On each startup, the application calls CreateConsumer() or Subscribe() with a new or no durable name. The previous instance’s consumers are still alive, and now there are duplicates.

  • Auto-scaling creating consumer instances. Each new pod/instance in an auto-scaled service creates its own ephemeral consumer. Scale-down events destroy the pod but the consumer persists until inactive_threshold expires.

  • Missing durable names. The application intends for consumers to be persistent but omits the durable name, creating a new ephemeral consumer on each connection. Using a durable name would reuse the existing consumer instead.

  • Client library creating implicit consumers. Some NATS client library convenience methods (like js.Subscribe() in Go) create new consumers implicitly. If called repeatedly without understanding the consumer lifecycle, each call adds a consumer.

  • Error handling creating retry consumers. A retry loop that creates a new consumer on each attempt. If the error is persistent, the loop generates hundreds of consumers before the issue is resolved.

How to diagnose

Terminal window
# Current consumer counts per stream
nats stream list --json | jq '.[] | {name: .config.name, consumers: .state.consumer_count}' | sort

Run this periodically and compare. A stream where consumer count grows monotonically has a leak.

Identify the leaking stream

Terminal window
# Find streams with unusually high consumer counts
nats stream list --json | jq '.[] | select(.state.consumer_count > 50) | {name: .config.name, consumers: .state.consumer_count}'

Examine consumer details

Terminal window
# List consumers on the affected stream
nats consumer list MY_STREAM --json | jq '.[] | {
name: .config.name,
durable: .config.durable_name,
created: .created,
inactive_threshold: .config.inactive_threshold,
last_activity: .ts,
pending: .num_pending,
ack_pending: .num_ack_pending
}'

Look for patterns:

  • Many consumers with no durable_name → ephemeral consumer leak
  • Many consumers with similar creation timestamps → restart-related leak
  • Many consumers with high inactive_threshold → slow cleanup
  • Many consumers with zero ack_pending and zero num_pending → inactive/orphaned consumers

Check consumer creation rate

Monitor the JetStream advisory subject for consumer creation events:

Terminal window
nats sub '$JS.EVENT.ADVISORY.CONSUMER.CREATED.MY_STREAM.>'

A steady stream of creation events without corresponding deletion events confirms an active leak.

Correlate with application deployments

If consumer count spikes correlate with deployment events (rolling updates, scaling events), the application lifecycle is the likely source:

Terminal window
# Check consumer creation times
nats consumer list MY_STREAM --json | jq '[.[] | .created] | sort | .[-10:]'

How to fix it

Clean up existing leaked consumers

Delete orphaned consumers that are no longer active:

Terminal window
# Delete a specific consumer
nats consumer delete MY_STREAM CONSUMER_NAME
# Delete all consumers matching a pattern (use with caution)
nats consumer list MY_STREAM --json | jq -r '.[] | select(.num_ack_pending == 0 and .num_pending == 0) | .config.name' | while read name; do
nats consumer delete MY_STREAM "$name" -f
done

Set appropriate inactive_threshold on ephemeral consumers

For ephemeral consumers, set inactive_threshold to match the expected inactivity window — typically a few seconds to a few minutes:

1
js, _ := nc.JetStream()
2
sub, err := js.Subscribe("orders.>",
3
func(msg *nats.Msg) {
4
processOrder(msg)
5
msg.Ack()
6
},
7
nats.InactiveThreshold(30*time.Second),
8
)
1
import nats
2
from nats.js.api import ConsumerConfig
3
from datetime import timedelta
4
5
nc = await nats.connect()
6
js = nc.jetstream()
7
8
await js.subscribe(
9
"orders.>",
10
cb=process_order,
11
config=ConsumerConfig(
12
inactive_threshold=timedelta(seconds=30).total_seconds(),
13
),
14
)

Use durable consumers where appropriate

If the consumer represents a logical processing identity that should survive restarts, use a durable name:

1
sub, err := js.PullSubscribe("orders.>", "order-processor",
2
nats.BindStream("ORDERS"),
3
)
4
// On restart, this reuses the existing "order-processor" consumer
5
// instead of creating a new one
1
sub = await js.pull_subscribe(
2
"orders.>",
3
durable="order-processor",
4
stream="ORDERS",
5
)

Fix application lifecycle management

Ensure consumers are cleaned up on application shutdown:

1
// Create consumer
2
sub, _ := js.Subscribe("orders.>", handler)
3
4
// On shutdown, unsubscribe (destroys ephemeral consumer)
5
shutdown := make(chan os.Signal, 1)
6
signal.Notify(shutdown, syscall.SIGTERM, syscall.SIGINT)
7
<-shutdown
8
sub.Unsubscribe() // removes the ephemeral consumer
1
import signal
2
import asyncio
3
4
sub = await js.subscribe("orders.>", cb=handler)
5
6
async def shutdown():
7
await sub.unsubscribe() # removes the ephemeral consumer
8
await nc.close()
9
10
loop = asyncio.get_event_loop()
11
loop.add_signal_handler(signal.SIGTERM, lambda: asyncio.create_task(shutdown()))

Set consumer limits as a safety net

Configure a maximum consumer count on streams to prevent unbounded growth:

Terminal window
nats stream edit MY_STREAM --max-consumers 100

When the limit is hit, new consumer creation fails with an error, making the leak immediately visible rather than silently degrading the cluster.

Monitor with alerting

Set up alerting on consumer count growth.

Frequently asked questions

What’s the difference between consumer churn and consumer growth?

Consumer churn (CONSUMER_002) is high create/delete rate — consumers being created and destroyed frequently. Consumer growth is the net count increasing over time — creation rate exceeds deletion rate. Churn causes metadata overhead from frequent Raft operations. Growth causes persistent resource accumulation. They can occur together: high churn with a slight imbalance toward creation produces steady growth.

How many consumers is too many on a single stream?

There’s no hard limit, but performance degrades gradually. At 100 consumers on an R3 stream, you have 300 Raft groups just for consumers. At 1,000, you have 3,000. Each Raft group adds heartbeat traffic, memory, and WAL I/O. For most workloads, more than 50–100 consumers on a single stream warrants investigation. If you genuinely need many consumers, consider partitioning across multiple streams.

Will setting inactive_threshold too low cause problems?

If inactive_threshold is shorter than the gap between message deliveries, the consumer may be cleaned up while still logically active. For pull consumers that fetch periodically, set inactive_threshold longer than the maximum expected interval between fetches. For push consumers, the subscription heartbeat keeps the consumer alive — inactive_threshold is only relevant after the subscription is gone.

Can I convert ephemeral consumers to durable ones?

Not in place. You need to delete the ephemeral consumer and create a new durable consumer with the same filter and deliver policy. If the ephemeral consumer has processing state (ack floor), you’ll need to set the new durable consumer’s deliver_policy to by_start_sequence matching the ephemeral consumer’s ack floor to avoid reprocessing.

Does Insights track the growth trend automatically?

Yes. Synadia Insights monitors consumer count per stream across the lookback window and flags streams where the count is growing steadily. The check distinguishes between legitimate scaling (correlated with connection growth) and leaks (growth without corresponding connection or workload changes), so it doesn’t alert on intentional consumer additions.

Proactive monitoring for NATS sustained consumer growth on stream with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel