NATS Stalled Clients: What They Mean and How to Fix Them

A stalled client is a NATS client whose outbound data is accumulating faster than the client can read it. Stalled clients indicate fast producers blocked by downstream backpressure — the server is throttling the producer because a consumer cannot keep up. The server detects that the write to the client’s connection is taking too long — approaching the write_deadline — and flags the client as stalled. This is normal flow control and an early warning that the client is on the path to slow consumer eviction if the condition persists.

Why this matters

Stalled client events are your intervention window. A slow consumer disconnection (SERVER_004) is the final action — the server gives up on the client and drops the connection, losing all buffered and in-flight messages for core NATS subscribers. A stalled client event fires before that happens, when the server’s write to the client is taking longer than expected but hasn’t yet exceeded the write_deadline. This is the difference between catching a problem and cleaning up after one.

The gap between “stalled” and “disconnected” is often measured in seconds. The server’s write_deadline defaults to 10 seconds — once a write to a client’s socket takes longer than that, the connection is closed. Stalled client events fire when the write is slow but hasn’t timed out yet. In a system doing 100,000 msg/s, those few seconds represent hundreds of thousands of messages. If you can react to the stalled client warning — by scaling up consumers, reducing publish rates, or fixing a downstream bottleneck — you prevent the disconnection and the associated data loss.

Stalled client events also serve as a leading indicator of broader system health. A single stalled client might be an application problem — one slow consumer in a sea of healthy ones. But multiple clients stalling simultaneously points to a systemic issue: a network degradation, server resource pressure, or a traffic spike that’s overwhelming a class of consumers. Catching the pattern early, before it cascades into mass disconnections, is the difference between a minor operational event and a production incident.

Common causes

Synchronous processing in the message handler. The subscriber does blocking work — database writes, HTTP calls, file I/O — inside the message callback. The NATS client library can’t read data from its socket while the callback is blocked, so pending data accumulates on the server side. This is by far the most common cause.
Downstream dependency slowdown. The subscriber is fast, but something it depends on — a database, an API, a cache — is responding slowly. Each message takes longer to process, the internal queue fills, and the client falls behind the inbound rate. The stall appears to be in the NATS client, but the root cause is elsewhere.
Network congestion between server and client. The server writes data to the client’s TCP socket, but the network path is congested. TCP send buffers fill, the server’s write blocks, and the write approaches the deadline. This is distinct from application-level slowness — the client may be perfectly fast, but the pipe between server and client is the bottleneck.
High fan-out overwhelming individual subscribers. A subscriber on a wildcard subject like events.> receives the aggregate traffic of thousands of specific subjects. No individual subject is particularly hot, but the combined rate exceeds what a single subscriber can process. The client stalls because the inbound rate is simply too high for a single reader.
Garbage collection pauses. Runtime GC pauses (JVM, Go, .NET) freeze the client’s read loop. During the pause, the server continues writing data to the socket, and the OS receive buffer fills. If the pause is long enough — and the message rate high enough — the server’s write blocks, triggering a stalled client event.
Insufficient client-side buffer capacity. The NATS client library maintains a pending message buffer between the network reader and the application callback. If this buffer is too small, it fills quickly, the network reader blocks, the TCP receive window closes, and the server’s write stalls. Many client libraries default to 64MB, which may be inadequate for high-throughput subjects.

How to diagnose

Check for connections with high pending bytes

curl -s 'http://localhost:8222/connz?sort=pending_bytes&limit=20' | jq '.connections[]'

Connections with high pending bytes are either currently stalled or at risk of stalling. The pending value shows how much data the server has buffered for the client but hasn’t been able to write yet.

Check the stalled clients counter

curl -s http://localhost:8222/varz | jq '.stalled_clients'

This counter increments each time a client write stalls. A rising count indicates ongoing stalled client events. Compare against the slow_consumers counter to see if stalled clients are progressing to full disconnections.

Identify specific stalled clients from logs

Server logs record stalled client events with connection details:

grep -i "stalled" /var/log/nats/nats-server.log

The log entry includes the connection ID and client name, allowing you to map stalled events to specific applications.

Check the write deadline configuration

curl -s http://localhost:8222/varz | jq '.write_deadline'

The write_deadline value (in nanoseconds) determines how long the server waits before disconnecting a slow writer. The default is 10 seconds (10,000,000,000 ns). Stalled client events fire when writes approach this deadline.

Compare publish rate to subscriber throughput

# Check per-subject message rates
nats server report accounts

If the publish rate on the affected subjects is significantly higher than what a single subscriber can process, the stall is a throughput mismatch — not a transient issue.

How to fix it

Immediate: widen the buffer to prevent disconnection

Raise the per-subscription pending buffer. In nats.go, pending limits are set on each subscription, not on the connection. A larger limit gives the client more room to buffer messages and absorb temporary slowdowns:

1
// Go client — set pending limits on the subscription
2
sub, _ := nc.Subscribe("orders.>", handler)
3
sub.SetPendingLimits(1_000_000, 256*1024*1024) // 1M msgs, 256MB

This is a buffer, not a fix. It buys time for transient slowdowns but doesn’t address a sustained throughput mismatch.

Temporarily increase the write deadline to prevent disconnection while you investigate:

1
write_deadline: "30s"

nats-server --signal reload

Only do this as a stopgap. A 30-second write deadline means the server holds memory for slow clients much longer, increasing memory pressure.

Short-term: fix the processing bottleneck

Decouple message reading from processing. The message callback should enqueue work, not perform it. A separate worker pool processes the queue, letting the NATS reader run at full speed:

1
// Go — async processing pattern
2
work := make(chan *nats.Msg, 50_000)
3

4
sub, _ := nc.Subscribe("orders.>", func(msg *nats.Msg) {
5
    select {
6
    case work <- msg:
7
    default:
8
        // Queue full — apply backpressure or log
9
    }
10
})
11

12
// Worker pool
13
for i := 0; i < 20; i++ {
14
    go func() {
15
        for msg := range work {
16
            processOrder(msg)
17
        }
18
    }()
19
}

1
# Python — async processing with queue
2
import asyncio
3
import nats
4

5
async def main():
6
    nc = await nats.connect("nats://localhost:4222")
7
    queue = asyncio.Queue(maxsize=50_000)
8

9
    async def handler(msg):
10
        await queue.put(msg)
11

12
    await nc.subscribe("orders.>", cb=handler)
13

14
    # Worker tasks
15
    async def worker():
16
        while True:
17
            msg = await queue.get()
18
            await process_order(msg)
19

20
    workers = [asyncio.create_task(worker()) for _ in range(20)]
21
    await asyncio.gather(*workers)

Add queue group subscribers to distribute the load across multiple consumer instances:

# Each instance joins the same queue group — NATS distributes messages
nats sub "orders.>" --queue order-processors

Adding queue group members (consumer instances) linearly increases throughput. If one subscriber handles 10,000 msg/s, four subscribers handle ~40,000 msg/s.

Improve consumer processing speed. Profile your message handlers to find bottlenecks. Ensure callbacks are non-blocking and offload heavy work to worker pools.

Fix the downstream dependency. If the stall is caused by a slow database or API, profile the dependency separately. Common fixes: add connection pooling, increase database replica count, add caching, or batch writes.

Long-term: design for sustainable throughput

Use JetStream pull consumers for flow control. Pull consumers let the client request messages at its own pace — the server never pushes faster than the client can handle, eliminating the stall condition entirely:

1
// Go — JetStream pull consumer
2
js, _ := nc.JetStream()
3

4
sub, _ := js.PullSubscribe("orders.>", "order-processor",
5
    nats.MaxAckPending(1000),
6
)
7

8
for {
9
    msgs, _ := sub.Fetch(100)  // Client controls the rate
10
    for _, msg := range msgs {
11
        processOrder(msg)
12
        msg.Ack()
13
    }
14
}

Partition high-fan-out subjects. Instead of one subscriber for events.>, split by prefix:

1
// Partition consumers by subject prefix
2
nc.QueueSubscribe("events.orders.>", "processors", orderHandler)
3
nc.QueueSubscribe("events.inventory.>", "processors", inventoryHandler)
4
nc.QueueSubscribe("events.shipping.>", "processors", shippingHandler)

Each consumer handles a fraction of the total traffic, reducing per-consumer load.

Implement application-level monitoring. Track the depth of your internal processing queue and the time-per-message in your handlers. Alert when queue depth grows or processing time increases — these are leading indicators that a stall is coming.

Frequently asked questions

What’s the difference between a stalled client and a slow consumer?

A stalled client (SERVER_013) is the early warning. The server’s write to the client is slow but hasn’t exceeded the write_deadline yet — the client is still connected. A slow consumer (SERVER_004) is the final action — the server has exceeded the write deadline and disconnected the client. If you see stalled client events progressing to slow consumer events, the stall is sustained and the client can’t recover fast enough. If you see stalled client events without corresponding slow consumer events, the stalls are transient — the client recovers before the deadline.

Does increasing the write deadline prevent stalled client events?

No. Increasing the write_deadline prevents slow consumer disconnections, not stalled client events. A stalled client event fires when the server detects that a write to the client is taking longer than expected — this detection happens before the deadline is reached. A longer deadline gives the client more time to recover before being disconnected, but the stall itself is still detected and logged.

Can JetStream consumers be stalled clients?

Yes. JetStream push consumers use the same underlying NATS connection and are subject to the same write mechanics. The difference is what happens next: if a JetStream consumer gets disconnected as a slow consumer, it can resume from its last acknowledged position on reconnect. A core NATS subscriber loses all buffered messages permanently. For workloads where stalling is a risk, JetStream pull consumers are strongly preferred because they eliminate server-side push pressure entirely.

How do I set up alerting for stalled clients?

Monitor the stalled_clients counter from the /varz endpoint. Alert when the delta is greater than zero.

Synadia Insights evaluates this automatically every collection epoch and correlates stalled client events with slow consumer events across your deployment, giving you a unified view of consumer health.

Should I treat stalled client warnings as urgent?

It depends on the pattern. A single stalled client event that doesn’t progress to a slow consumer disconnection is informational — the client recovered. Repeated stalled client events for the same client indicate a persistent throughput mismatch that will eventually result in disconnection. Multiple clients stalling simultaneously is a systemic issue that warrants immediate investigation — the next step is mass disconnections and potential data loss.

FEATURED

RESOURCES

Comparisons