Checks/JETSTREAM_006

NATS Consumer Churn High: What It Means and How to Fix It

Severity
Warning
Category
Errors
Applies to
JetStream
Check ID
JETSTREAM_006
Detection threshold
consumer count delta exceeds configured threshold (default: 5,000) per collection epoch

Consumer churn high means the JetStream consumer population changed significantly between collection epochs. The check fires when the absolute delta in total consumer count exceeds the configured threshold (default: 5,000) — meaning a wave of consumers was created, deleted, or both within a single observation interval. Rapid consumer lifecycle activity — ephemeral consumers being created and destroyed in tight loops, durable consumers being repeatedly deleted and recreated, or a sudden mass-deletion of consumers — puts unnecessary load on the Raft consensus layer and degrades cluster performance.

Why this matters

Every JetStream consumer creation and deletion is a Raft operation. For replicated consumers (R > 1), the operation must be proposed by the meta leader, committed by a quorum of peers, and applied to the meta group state. At low volumes this is invisible. At thousands of operations per epoch, it becomes a significant source of CPU, disk I/O, and network overhead on the meta group — the same coordination layer that handles stream management, leader elections, and all other JetStream API operations.

The impact radiates outward. As the meta leader spends more cycles processing consumer churn, other JetStream API operations slow down. Stream creation latency increases. Consumer info lookups take longer. If the churn is sustained, it can trigger JS API Pending High (JETSTREAM_005) and JS API Request Rate High (JETSTREAM_004) alerts simultaneously, signaling that the control plane is congested. In extreme cases, the meta leader falls behind on Raft heartbeats, triggering unnecessary leader elections that temporarily halt all JetStream API processing.

Beyond the immediate performance hit, high consumer churn is almost always a symptom of a design problem in the consuming application. Ephemeral consumers created per-request, durable consumers recreated on every reconnect, or CI/CD pipelines spinning up test consumers without cleanup — these patterns work at small scale but become cluster-level problems as traffic grows. The churn itself is the signal; the fix is in the application.

Common causes

  • Ephemeral consumers created per request. The application creates a new ephemeral consumer for each incoming request, processes messages, then lets the consumer be garbage collected. This pattern generates two Raft operations (create + delete) per request. At 100 requests per second, that’s 12,000 Raft operations per minute from consumer lifecycle alone.

  • Durable consumers recreated on every connection. Client code that calls DeleteConsumer followed by CreateConsumer on each startup — often to “reset” consumer state or apply updated configuration — generates churn proportional to the number of client restarts. During a rolling deploy of 50 instances, this creates 100 unnecessary Raft operations.

  • Short inactive threshold on ephemeral consumers. Ephemeral consumers are deleted when they have no subscribers for longer than their inactive_threshold (default: 5 seconds). If the subscribing application has brief disconnects — network blips, GC pauses, load balancer health checks — the consumer may be garbage collected prematurely and then recreated, causing repeated churn. Setting inactive_threshold to a longer duration prevents deletion during brief inactivity gaps.

  • Deployment cycling recreating all consumers. CI/CD pipelines or orchestration systems that tear down and rebuild consumers as part of each deploy generate burst churn. If the pipeline runs frequently (canary deploys, blue-green switches), each cycle adds to the total.

  • Test or staging consumers not cleaned up. Automated test suites that create consumers for integration tests but don’t reliably delete them afterward leave behind consumer metadata. When a cleanup job eventually runs, the bulk deletion spikes churn.

  • Client library misconfiguration. Some client libraries create a new consumer if the existing one doesn’t exactly match the requested configuration. A subtle mismatch — different ack_wait, different max_deliver — causes the library to delete and recreate the consumer on every connection.

How to diagnose

Confirm consumer churn rate

Check the total consumer count across the cluster:

Terminal window
nats server report jetstream

Compare the consumer count across consecutive runs. A large delta between epochs (the default threshold is 5,000) indicates churn. For a more granular view, list consumers on specific streams:

Terminal window
nats consumer ls ORDERS

Watch consumer lifecycle events in real time

JetStream publishes advisory events for every consumer creation and deletion:

Terminal window
nats event --js-advisory

Look for rapid alternation between ConsumerCreated and ConsumerDeleted events on the same stream. The advisory includes the consumer name, stream, account, and client information — use this to identify which application is generating the churn.

Identify ephemeral vs durable churn

Ephemeral consumers have system-generated names (random strings). Durable consumers have explicit names set by the application. If the churn is all ephemeral consumers, the issue is likely a per-request consumer pattern. If it’s durable consumers being deleted and recreated, look for deployment or reconnection logic.

Terminal window
# List consumers with details to see which are durable
nats consumer ls ORDERS -j | jq '.[].config.durable_name'

Null values indicate ephemeral consumers. Named values are durable.

Check API request patterns

Consumer churn drives JetStream API traffic. Correlate with API metrics:

Terminal window
curl -s http://localhost:8222/jsz | jq '.api'

If total is climbing fast and consumer churn is the primary cause, the inflight count may also be elevated.

How to fix it

Immediate: stop the bleeding

Pause or throttle the churning application. If you can identify the application generating the churn from advisory events, scale it down or pause its deploys while you fix the root cause. This is especially important if churn is causing cascading effects on API latency.

Increase the inactive threshold for ephemeral consumers. If ephemeral consumers are being garbage collected too aggressively during brief disconnects, increase inactive_threshold to give clients more time to reconnect:

1
// Go — set a longer inactive threshold for ephemeral consumers
2
cons, err := js.CreateConsumer(ctx, "ORDERS", jetstream.ConsumerConfig{
3
FilterSubject: "ORDERS.>",
4
AckPolicy: jetstream.AckExplicitPolicy,
5
InactiveThreshold: 5 * time.Minute, // default is 5s
6
})
1
# Python (nats.py)
2
from nats.js.api import ConsumerConfig
3
config = ConsumerConfig(
4
filter_subject="ORDERS.>",
5
ack_policy="explicit",
6
inactive_threshold=300, # 5 minutes in seconds
7
)
8
await js.subscribe("ORDERS.>", config=config)

Short-term: fix the consumer lifecycle

Switch from ephemeral to durable consumers. Durable consumers persist across connections. The application connects, binds to an existing consumer, and starts processing — no create or delete needed:

1
// Go — bind to an existing durable consumer
2
cons, err := js.Consumer(ctx, "ORDERS", "order-processor")
3
if err != nil {
4
// Consumer doesn't exist yet — create it once
5
cons, err = js.CreateOrUpdateConsumer(ctx, "ORDERS", jetstream.ConsumerConfig{
6
Durable: "order-processor",
7
FilterSubject: "ORDERS.>",
8
AckPolicy: jetstream.AckExplicitPolicy,
9
})
10
}
1
# Python (nats.py) — use durable consumer
2
sub = await js.subscribe("ORDERS.>", durable="order-processor")

Stop deleting consumers on reconnection. If client code deletes and recreates durable consumers to “reset” state, remove that logic. Durable consumers track their delivery position. If you need to replay messages, use nats consumer edit to change the start position rather than recreating the consumer.

Use CreateOrUpdateConsumer instead of delete-and-create. Most NATS client libraries support an idempotent create-or-update operation. If the consumer already exists with the same configuration, it’s a no-op. If the configuration has changed, it updates in place:

1
// Go — idempotent create/update
2
cons, err := js.CreateOrUpdateConsumer(ctx, "ORDERS", jetstream.ConsumerConfig{
3
Durable: "order-processor",
4
AckPolicy: jetstream.AckExplicitPolicy,
5
MaxDeliver: 5,
6
})

Long-term: design for stable consumers

Provision consumers as infrastructure, not application logic. Create durable consumers as part of your infrastructure provisioning (Terraform, Helm, deployment manifests) — not in application startup code. Applications should bind to pre-existing consumers, not create them. This eliminates consumer lifecycle from the application entirely.

Use queue groups for scaling. Instead of creating a new consumer per application instance, use a single consumer with multiple subscribers via queue groups. NATS distributes messages across all subscribers in the group without requiring additional consumers:

Terminal window
# Multiple instances subscribe to the same consumer
nats consumer next ORDERS order-processor --count 10

Clean up test consumers in CI. If CI/CD pipelines create test consumers, ensure they’re deleted in the test teardown — not by a separate cleanup job. Better yet, use a dedicated test account with aggressive JetStream limits so test consumers can’t accumulate.

Frequently asked questions

What counts as consumer churn?

Consumer churn is the absolute change in total consumer count between two collection epochs — the net delta in the consumer count metric. If 100 consumers are created and 80 are deleted in the same epoch, the delta is 20 and the check does not fire. The check is watching for population-level swings, not raw lifecycle volume — meaning a sudden surge of new consumers, a sudden mass-deletion, or a sustained imbalance large enough to move the count by the configured threshold (default: 5,000).

Are ephemeral consumers always bad?

No. Ephemeral consumers are designed for short-lived, ad-hoc consumption — one-off queries, debugging, temporary processing. The problem is using them for steady-state workloads. If an ephemeral consumer is created and destroyed hundreds or thousands of times per epoch as part of normal application operation, that’s a design problem. For persistent workloads, durable consumers avoid the Raft overhead entirely.

Does consumer churn affect message delivery?

Not directly — existing consumers continue to deliver messages regardless of how many new consumers are being created or deleted. But the Raft overhead from churn can slow down the meta leader, which increases latency on all JetStream API operations. If churn triggers API pending or error rate alerts, new consumer creation may start failing, which prevents applications from receiving messages until the control plane recovers.

How do I find which application is creating consumers?

Watch JetStream advisory events: nats event --js-advisory. Each consumer creation advisory includes the client name, account, and connection details of the client that issued the API call. Cross-reference with your application inventory to identify the source. If your applications don’t set client names at connect time, fix that first — it’s essential for operational visibility.

Can I rate-limit consumer creation?

There’s no built-in per-client rate limit for JetStream API calls. You can set account-level limits on total consumer count (max_consumers in the account JWT), which caps the absolute number but doesn’t limit creation rate. The most effective approach is fixing the application pattern — switching to durable consumers eliminates the need for repeated creation entirely.

Proactive monitoring for NATS consumer churn high with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel