Checks/SERVER_012

NATS Stale Connections: What They Mean and How to Fix Them

Severity
Warning
Category
Errors
Applies to
Server
Check ID
SERVER_012
Detection threshold
stale_connections counter increases between collection intervals

A stale connection is a NATS client that stopped responding to the server’s PING keepalive requests. The server sends periodic PINGs to every client; when a client fails to reply with a PONG within the configured interval (default: 2 minutes with up to 2 outstanding pings), the server considers the connection stale and closes it. Unlike slow consumer evictions, stale connections indicate that the client itself is unresponsive — not that it can’t keep up with message flow.

Why this matters

Stale connections are a signal that something is fundamentally wrong with the client process or the network path between client and server. The NATS ping/pong mechanism is deliberately lightweight — a client only needs to respond to a small control message, not process application data. If a client can’t even manage that, it’s either dead (process crash, OOM kill, frozen), unreachable (network partition, firewall timeout, NAT table expiration), or so overloaded that it can’t service its TCP socket at all.

The problem with stale connections is the delay between the client becoming unresponsive and the server detecting it. With default settings (ping_interval: 2m, ping_max: 2), a server won’t close a stale connection for up to 6 minutes. During that window, the server continues to buffer outbound messages for the dead client, consuming memory. If the client was a subscriber on high-throughput subjects, the buffered data can be substantial — potentially triggering slow consumer conditions on top of the stale connection.

In deployments where connection counts matter — approaching max_connections, or where per-account connection limits are enforced — stale connections waste slots. A server holding 500 stale connections from crashed pods is 500 connections unavailable for live clients. If the stale connection detection is slow and a deployment reschedules those pods, the new instances may be rejected for exceeding connection limits while their predecessors’ zombie connections are still sitting on the server.

Common causes

  • Client process crash or OOM kill. The client process terminates abruptly without closing its NATS connection. The TCP connection remains in an established state from the server’s perspective until the OS TCP keepalive or NATS ping timeout detects the failure. This is the most common cause in containerized environments where pods are killed without graceful shutdown.

  • Network partition or firewall dropping idle connections. A network device between client and server — firewall, NAT gateway, load balancer — drops the connection’s state without sending a TCP RST. The server sends PING but the client never receives it (or the PONG never makes it back). Cloud provider NAT gateways commonly expire idle TCP connections after 5-10 minutes, silently severing connections that haven’t exchanged data. Check firewall idle-timeout settings and ensure they exceed the NATS ping interval.

  • Client event loop blocked and unable to process PINGs. The client process is alive but frozen — deadlocked, stuck in a long GC pause, or blocked on a system call that never returns. The process can’t respond to pings because no code is executing. This is common with JVM applications experiencing full GC pauses or Go applications blocked on a mutex.

  • Load balancer or proxy in the path. TCP proxies and load balancers may absorb or delay NATS protocol messages, including pings and pongs. Some proxies terminate idle connections without notifying both sides. If the load balancer sits between client and server, the server pings the proxy, and the proxy may not forward the ping to the actual client.

  • Client library misconfiguration. Some NATS client libraries allow disabling the ping/pong mechanism or setting excessively long intervals. If the client’s ping interval is much longer than the server’s, the server may consider the client stale before the client has a chance to send its own ping.

How to diagnose

Confirm stale connection events are occurring

Check the server logs for stale connection warnings:

Terminal window
# In server logs, look for:
# [WRN] Stale Client Connection - ...
grep -i "stale" /var/log/nats/nats-server.log

The log entry includes the connection ID (cid), client name, and the connection’s last known state.

Identify connections with high idle time

Terminal window
nats server request connections --sort idle

Connections with idle times approaching or exceeding the ping_interval × ping_max window are at risk of being marked stale. Connections already marked as stale will have been disconnected and won’t appear in this list — look at the server’s stale_connections counter instead.

Check server ping configuration

Terminal window
curl -s http://localhost:8222/varz | jq '{ping_interval: .ping_interval, ping_max: .ping_max}'

The ping_interval is in nanoseconds. The default is 120,000,000,000 (2 minutes). The server will close a connection after ping_interval × (ping_max + 1) of no response — by default, ~6 minutes.

Monitor the stale connections counter

Terminal window
curl -s http://localhost:8222/varz | jq '.stale_connections'

This is a cumulative counter. A rising count over your monitoring interval indicates ongoing stale connection events.

Check for network path issues

If stale connections correlate with specific client locations or networks:

Terminal window
# Check RTT to clients that are going stale
nats server list

Clients with high or unstable RTT are more susceptible to stale connection detection, especially with aggressive ping settings.

How to fix it

Immediate: detect stale connections faster

Tighten ping interval and max missed pongs. The default 2-minute interval with 2 missed pongs means up to 6 minutes before detection. For environments where fast detection matters:

1
# Server configuration
2
ping_interval: "30s"
3
ping_max: 2

This detects unresponsive clients within ~90 seconds instead of ~6 minutes. The trade-off is slightly more control traffic — each client receives a ping every 30 seconds instead of every 2 minutes.

Reload the server configuration:

Terminal window
nats-server --signal reload

Ensure clients handle pings correctly. Most NATS client libraries handle server pings automatically. Verify your client library is configured to respond to pings — this is default behavior, but some advanced configurations may inadvertently disable it:

1
// Go — ping settings (defaults are usually fine)
2
nc, err := nats.Connect(url,
3
nats.PingInterval(60*time.Second), // Client-side ping interval
4
nats.MaxPingsOutstanding(3),
5
)

Short-term: fix the underlying disconnection causes

Implement graceful shutdown in client applications. When a process terminates, close the NATS connection before exiting. This sends a proper disconnect to the server, freeing the connection slot immediately:

1
// Go — graceful shutdown
2
import (
3
"os"
4
"os/signal"
5
"syscall"
6
"github.com/nats-io/nats.go"
7
)
8
9
func main() {
10
nc, _ := nats.Connect("nats://localhost:4222")
11
12
quit := make(chan os.Signal, 1)
13
signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
14
<-quit
15
16
nc.Drain() // Gracefully drain and close
17
}
1
# Python — graceful shutdown
2
import asyncio
3
import signal
4
import nats
5
6
async def main():
7
nc = await nats.connect("nats://localhost:4222")
8
9
loop = asyncio.get_event_loop()
10
loop.add_signal_handler(signal.SIGTERM, lambda: asyncio.ensure_future(nc.drain()))
11
12
await asyncio.Future() # Run forever

In Kubernetes, add a preStop hook to give the application time to drain its NATS connection before the container is killed.

Configure NAT gateway and firewall keepalives. If clients connect through NAT gateways or firewalls, ensure TCP keepalive intervals are shorter than the device’s connection timeout. NATS client libraries typically set TCP keepalive, but the interval may be longer than the NAT gateway’s idle timeout:

1
// Go — set TCP keepalive via custom dialer (if defaults aren't sufficient)
2
nc, err := nats.Connect(url,
3
nats.SetCustomDialer(&net.Dialer{
4
KeepAlive: 30 * time.Second,
5
}),
6
)

Long-term: design for connection hygiene

Monitor stale connection trends. A rising stale connection rate indicates a systemic issue — deployment patterns, network infrastructure, or client lifecycle problems that need architectural attention.

Set per-account connection limits. This ensures that even if one account’s clients are producing stale connections, they can’t exhaust connection capacity for other accounts:

1
accounts {
2
ORDERS {
3
users: [{user: orders, password: secret}]
4
limits {
5
conn: 500
6
}
7
}
8
}

Use connection names for attribution. Every NATS client should set a connection name at connect time. Without names, stale connections in logs and reports are anonymous — you can’t tell which application or team is responsible:

1
nc, err := nats.Connect(url, nats.Name("order-processor-v2"))

Frequently asked questions

What’s the difference between a stale connection and a slow consumer?

A stale connection (SERVER_012) means the client stopped responding to pings entirely — the client is unreachable or frozen. A slow consumer (SERVER_004) means the client is alive and responding to pings but can’t keep up with the message delivery rate. The detection mechanisms are different: stale connections are detected by the ping/pong protocol, while slow consumers are detected by outbound buffer pressure. A frozen client may trigger both checks — first stalled client warnings (SERVER_013), then stale connection detection when pings fail.

How does NATS detect stale connections?

The server sends a PING control message to each client at the configured ping_interval (default: 2 minutes). The client must respond with a PONG. If the server sends ping_max consecutive PINGs (default: 2) without receiving a PONG, it closes the connection and increments the stale_connections counter. The total detection time is approximately ping_interval × (ping_max + 1).

Do stale connections affect server performance?

Yes, in two ways. First, the server maintains memory for each connection’s read/write buffers and subscription state — stale connections consume these resources without providing value. Second, if the stale client was subscribed to active subjects, the server buffers outbound messages for the dead client until it detects the stale state, consuming additional memory. In high-throughput environments, this buffering can be significant during the detection window.

Can I force-close a specific connection?

Yes. Use the server’s /connz endpoint to find the connection ID, then use the admin API to close it:

Terminal window
# Find the connection
curl -s http://localhost:8222/connz?sort=idle&limit=5 | jq '.connections[] | {cid, name, idle}'
# The NATS CLI does not have a direct "kick" command, but you can
# close connections via the system account using request/reply
nats request '$SYS.REQ.SERVER.PING.CONNZ' '' --replies 1

For programmatic connection management, use account-level connection limits or the system account API.

Should I set a very aggressive ping interval to detect stale connections faster?

Be cautious. A very short interval (e.g., 5 seconds) generates significant control traffic in deployments with thousands of connections. It also increases the risk of false positives — a client that’s temporarily slow (not dead) may miss a ping deadline and get disconnected unnecessarily. A ping_interval of 20-30 seconds with ping_max: 2 is a reasonable balance between detection speed and reliability for most deployments.

Proactive monitoring for NATS stale connections with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel