NATS JetStream API Error Rate High: What It Means and How to Fix It

JetStream API error rate high means the ratio of failed JetStream API requests to total requests has exceeded the configured threshold. These errors indicate that a significant fraction of control-plane operations — stream creation, consumer management, info lookups — are failing, pointing to systemic misconfiguration, resource exhaustion, or permission issues.

Why this matters

JetStream API errors are not transient hiccups. Each error represents a failed operation: a stream that wasn’t created, a consumer that wasn’t provisioned, a metadata lookup that returned nothing useful. When the error rate exceeds the threshold, it means a meaningful percentage of your cluster’s control-plane traffic is failing — and the applications making those calls are either silently degraded or actively broken.

The damage depends on how applications handle failures. Well-written clients retry with backoff. Poorly written clients retry immediately, amplifying the error rate and adding load to an already stressed meta leader. This feedback loop is common: a burst of errors triggers retries, retries increase API load, increased load causes more errors. What started as a configuration problem or a transient resource issue becomes a sustained control-plane outage.

API errors also mask real problems. If your monitoring queries StreamInfo to track stream health and those queries are failing, you’ve lost visibility into the data plane. If consumer creation fails silently, messages accumulate in streams with no one processing them. The JetStream API is the management interface for your entire messaging infrastructure — when it’s returning errors at scale, you’re operating blind.

Common causes

Application referencing deleted or renamed streams. Client code hardcodes a stream name that was deleted, renamed, or moved to a different account. Every API call against that stream returns “stream not found.” This is especially common after infrastructure refactors where stream names change but not all client configurations are updated.
Permission denials on JetStream operations. The account or user making the API call lacks JetStream permissions. This surfaces as errors on every attempt to create, query, or modify streams and consumers. Common when a new account is provisioned without JetStream access or when JWT permissions are updated but not propagated to all clients.
Resource limits exceeded. The account has hit its max_streams, max_consumers, or JetStream storage limit. Every subsequent create operation fails. The first creation succeeds; it’s the 101st stream (if the limit is 100) that fails, making this hard to catch without monitoring.
Race conditions in consumer creation. Multiple application instances simultaneously try to create the same durable consumer with different configurations. One succeeds; the others get “consumer already exists with different configuration.” This is common during rolling deploys where old and new versions coexist briefly.
API calls during meta leader elections. When the meta leader steps down or a new election occurs, JetStream API requests submitted during the election window return errors. If leader elections are frequent (META_003), this creates periodic error spikes that push the error rate above the threshold.
Stale client configuration. Applications deployed with outdated configuration — wrong stream names, wrong consumer settings, wrong account credentials — generate a steady stream of errors on every API interaction. This accumulates across instances: 20 misconfigured instances each making 10 API calls per minute produces 200 errors per minute.

How to diagnose

Check the current error rate

View JetStream API totals and errors per server:

nats server report jetstream

The report shows API request and error counts per server. For raw numbers:

curl -s http://localhost:8222/jsz | jq '.api'

This returns total and errors. The check fires when errors / total exceeds the threshold (default: 1%) with at least the minimum request volume (default: 100 requests).

Identify specific error types

JetStream advisory events include error details. Watch them in real time:

nats event --js-advisory

Each API error advisory includes the error code, description, account, and the client that triggered it. Common error messages include:

stream not found — stream doesn’t exist or wrong account
consumer already exists — durable consumer exists with different config
insufficient resources — account or server limits reached
not authorized — missing JetStream permissions

Check for resource limit issues

If errors are resource-related, verify account-level JetStream limits:

nats account info

This shows current usage vs. limits for streams, consumers, memory, and storage. If any are at 100%, all subsequent create operations will fail.

Check for stale stream references

List all streams accessible to the current account and compare against what your applications expect:

nats stream ls -a

If a stream your applications reference doesn’t appear in this list, it’s been deleted, renamed, or is in a different account.

Correlate with meta leader stability

If errors spike periodically, check whether they coincide with leader elections:

nats event --js-advisory

Look for leader election events interleaved with API errors. If the pattern matches, the root cause is leader instability (META_003), not application misconfiguration.

How to fix it

Immediate: stop the error amplification

Identify the specific API error categories. Check server logs for the specific error types driving the rate up — common categories include permission denials (403), stream/consumer not found (404), and resource exhaustion (503). High error rates often correlate with client misconfiguration (wrong stream names, insufficient permissions) rather than server issues. The advisory stream tells you which specific errors are occurring and which clients are generating them. Fix the highest-volume error first — updating one misconfigured application can cut the error rate dramatically.

Add error handling and backoff in clients. If applications retry JetStream API calls on failure, ensure they use exponential backoff. Immediate retries on a congested or misconfigured system make everything worse:

1
// Go — use retry with backoff for JetStream operations
2
var stream jetstream.Stream
3
var err error
4
for attempt := 0; attempt < 5; attempt++ {
5
    stream, err = js.Stream(ctx, "ORDERS")
6
    if err == nil {
7
        break
8
    }
9
    time.Sleep(time.Duration(1<<attempt) * 100 * time.Millisecond)
10
}

Short-term: fix the root causes

Update stale stream and consumer references. Audit all applications’ NATS configuration against the actual stream/consumer inventory. Automate this check in CI: compare expected stream names against the live cluster and fail the build if they don’t match.

Fix consumer creation race conditions. Use CreateOrUpdateConsumer instead of plain CreateConsumer. The idempotent variant succeeds if the consumer already exists with a compatible configuration, eliminating races during rolling deploys:

1
// Go — idempotent consumer creation
2
cons, err := js.CreateOrUpdateConsumer(ctx, "ORDERS", jetstream.ConsumerConfig{
3
    Durable:   "order-processor",
4
    AckPolicy: jetstream.AckExplicitPolicy,
5
})

Increase account resource limits. If errors are caused by hitting max_streams or max_consumers limits, increase them in the account JWT or operator configuration. Set limits with headroom — if you have 95 streams and a limit of 100, the next deploy that adds streams will fail.

Grant JetStream permissions to the correct accounts. If errors are permission denials, update the account or user JWT to include JetStream access. Verify with:

nats account info

The output shows whether JetStream is enabled and what limits apply.

Long-term: prevent error accumulation

Centralize stream and consumer provisioning. Define streams and consumers in infrastructure-as-code (Terraform, Helm values, declarative config). Applications bind to pre-existing resources rather than creating them at startup. This eliminates entire categories of errors: “stream not found” (because the stream is always created before the app deploys), race conditions (because consumers are created once, not per-instance), and permission issues (because provisioning runs with admin credentials).

Monitor the advisory stream. Set up persistent monitoring on $JS.EVENT.ADVISORY.> to track error patterns over time. Alert on new error types — a new “stream not found” error likely means a recent deployment broke a reference, and catching it early prevents the error from propagating to all instances.

Implement pre-deploy validation. Before deploying application changes, validate that all referenced streams and consumers exist and are accessible. This catches stale references before they hit production and generate API errors.

Frequently asked questions

What types of errors count toward the API error rate?

Every JetStream API response that returns an error status counts. This includes stream and consumer not found, permission denied, insufficient resources, duplicate name conflicts, invalid configuration, and transient errors during leader elections. The check doesn’t distinguish between error types — any sustained error rate above the threshold fires the alert.

Are transient errors during leader elections a problem?

Brief error spikes during leader elections are expected and usually resolve within seconds. The check uses a minimum request volume threshold (default: 100) to avoid false positives on low-traffic clusters. However, if leader elections are frequent (META_003), the cumulative error volume can push the rate above the threshold. In that case, fix the leader stability issue first.

How do I find which specific API calls are failing?

Watch the JetStream advisory stream: nats event --js-advisory. Each error advisory includes the operation type (stream create, consumer create, info lookup), the error code, the account, and the client connection details. For historical analysis, subscribe a durable consumer to $JS.EVENT.ADVISORY.API and persist events to a log or monitoring system.

Should I set the error rate threshold lower than 1%?

The 1% default catches systemic issues without alerting on occasional transient errors. For critical production systems where any control-plane error is concerning, lowering to 0.5% or even 0.1% provides earlier warning. For development or staging clusters where errors during testing are expected, raising to 5% reduces noise.

Can JetStream API errors cause message loss?

Not directly. API errors affect the control plane — stream and consumer management — not the data plane. Messages already stored in streams are safe. However, if consumer creation fails, no application can consume those messages until the consumer is successfully created. And if publish operations fail because a stream can’t be found, messages are rejected at the publish side, which the publishing application must handle.

FEATURED

RESOURCES

Comparisons