A stale connection is a NATS client that stopped responding to the server’s PING keepalive requests. The server sends periodic PINGs to every client; when a client fails to reply with a PONG within the configured interval (default: 2 minutes with up to 2 outstanding pings), the server considers the connection stale and closes it. Unlike slow consumer evictions, stale connections indicate that the client itself is unresponsive — not that it can’t keep up with message flow.
Stale connections are a signal that something is fundamentally wrong with the client process or the network path between client and server. The NATS ping/pong mechanism is deliberately lightweight — a client only needs to respond to a small control message, not process application data. If a client can’t even manage that, it’s either dead (process crash, OOM kill, frozen), unreachable (network partition, firewall timeout, NAT table expiration), or so overloaded that it can’t service its TCP socket at all.
The problem with stale connections is the delay between the client becoming unresponsive and the server detecting it. With default settings (ping_interval: 2m, ping_max: 2), a server won’t close a stale connection for up to 6 minutes. During that window, the server continues to buffer outbound messages for the dead client, consuming memory. If the client was a subscriber on high-throughput subjects, the buffered data can be substantial — potentially triggering slow consumer conditions on top of the stale connection.
In deployments where connection counts matter — approaching max_connections, or where per-account connection limits are enforced — stale connections waste slots. A server holding 500 stale connections from crashed pods is 500 connections unavailable for live clients. If the stale connection detection is slow and a deployment reschedules those pods, the new instances may be rejected for exceeding connection limits while their predecessors’ zombie connections are still sitting on the server.
Client process crash or OOM kill. The client process terminates abruptly without closing its NATS connection. The TCP connection remains in an established state from the server’s perspective until the OS TCP keepalive or NATS ping timeout detects the failure. This is the most common cause in containerized environments where pods are killed without graceful shutdown.
Network partition or firewall dropping idle connections. A network device between client and server — firewall, NAT gateway, load balancer — drops the connection’s state without sending a TCP RST. The server sends PING but the client never receives it (or the PONG never makes it back). Cloud provider NAT gateways commonly expire idle TCP connections after 5-10 minutes, silently severing connections that haven’t exchanged data. Check firewall idle-timeout settings and ensure they exceed the NATS ping interval.
Client event loop blocked and unable to process PINGs. The client process is alive but frozen — deadlocked, stuck in a long GC pause, or blocked on a system call that never returns. The process can’t respond to pings because no code is executing. This is common with JVM applications experiencing full GC pauses or Go applications blocked on a mutex.
Load balancer or proxy in the path. TCP proxies and load balancers may absorb or delay NATS protocol messages, including pings and pongs. Some proxies terminate idle connections without notifying both sides. If the load balancer sits between client and server, the server pings the proxy, and the proxy may not forward the ping to the actual client.
Client library misconfiguration. Some NATS client libraries allow disabling the ping/pong mechanism or setting excessively long intervals. If the client’s ping interval is much longer than the server’s, the server may consider the client stale before the client has a chance to send its own ping.
Check the server logs for stale connection warnings:
# In server logs, look for:# [WRN] Stale Client Connection - ...grep -i "stale" /var/log/nats/nats-server.logThe log entry includes the connection ID (cid), client name, and the connection’s last known state.
nats server request connections --sort idleConnections with idle times approaching or exceeding the ping_interval × ping_max window are at risk of being marked stale. Connections already marked as stale will have been disconnected and won’t appear in this list — look at the server’s stale_connections counter instead.
curl -s http://localhost:8222/varz | jq '{ping_interval: .ping_interval, ping_max: .ping_max}'The ping_interval is in nanoseconds. The default is 120,000,000,000 (2 minutes). The server will close a connection after ping_interval × (ping_max + 1) of no response — by default, ~6 minutes.
curl -s http://localhost:8222/varz | jq '.stale_connections'This is a cumulative counter. A rising count over your monitoring interval indicates ongoing stale connection events.
If stale connections correlate with specific client locations or networks:
# Check RTT to clients that are going stalenats server listClients with high or unstable RTT are more susceptible to stale connection detection, especially with aggressive ping settings.
Tighten ping interval and max missed pongs. The default 2-minute interval with 2 missed pongs means up to 6 minutes before detection. For environments where fast detection matters:
1# Server configuration2ping_interval: "30s"3ping_max: 2This detects unresponsive clients within ~90 seconds instead of ~6 minutes. The trade-off is slightly more control traffic — each client receives a ping every 30 seconds instead of every 2 minutes.
Reload the server configuration:
nats-server --signal reloadEnsure clients handle pings correctly. Most NATS client libraries handle server pings automatically. Verify your client library is configured to respond to pings — this is default behavior, but some advanced configurations may inadvertently disable it:
1// Go — ping settings (defaults are usually fine)2nc, err := nats.Connect(url,3 nats.PingInterval(60*time.Second), // Client-side ping interval4 nats.MaxPingsOutstanding(3),5)Implement graceful shutdown in client applications. When a process terminates, close the NATS connection before exiting. This sends a proper disconnect to the server, freeing the connection slot immediately:
1// Go — graceful shutdown2import (3 "os"4 "os/signal"5 "syscall"6 "github.com/nats-io/nats.go"7)8
9func main() {10 nc, _ := nats.Connect("nats://localhost:4222")11
12 quit := make(chan os.Signal, 1)13 signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)14 <-quit15
16 nc.Drain() // Gracefully drain and close17}1# Python — graceful shutdown2import asyncio3import signal4import nats5
6async def main():7 nc = await nats.connect("nats://localhost:4222")8
9 loop = asyncio.get_event_loop()10 loop.add_signal_handler(signal.SIGTERM, lambda: asyncio.ensure_future(nc.drain()))11
12 await asyncio.Future() # Run foreverIn Kubernetes, add a preStop hook to give the application time to drain its NATS connection before the container is killed.
Configure NAT gateway and firewall keepalives. If clients connect through NAT gateways or firewalls, ensure TCP keepalive intervals are shorter than the device’s connection timeout. NATS client libraries typically set TCP keepalive, but the interval may be longer than the NAT gateway’s idle timeout:
1// Go — set TCP keepalive via custom dialer (if defaults aren't sufficient)2nc, err := nats.Connect(url,3 nats.SetCustomDialer(&net.Dialer{4 KeepAlive: 30 * time.Second,5 }),6)Monitor stale connection trends. A rising stale connection rate indicates a systemic issue — deployment patterns, network infrastructure, or client lifecycle problems that need architectural attention.
Set per-account connection limits. This ensures that even if one account’s clients are producing stale connections, they can’t exhaust connection capacity for other accounts:
1accounts {2 ORDERS {3 users: [{user: orders, password: secret}]4 limits {5 conn: 5006 }7 }8}Use connection names for attribution. Every NATS client should set a connection name at connect time. Without names, stale connections in logs and reports are anonymous — you can’t tell which application or team is responsible:
1nc, err := nats.Connect(url, nats.Name("order-processor-v2"))A stale connection (SERVER_012) means the client stopped responding to pings entirely — the client is unreachable or frozen. A slow consumer (SERVER_004) means the client is alive and responding to pings but can’t keep up with the message delivery rate. The detection mechanisms are different: stale connections are detected by the ping/pong protocol, while slow consumers are detected by outbound buffer pressure. A frozen client may trigger both checks — first stalled client warnings (SERVER_013), then stale connection detection when pings fail.
The server sends a PING control message to each client at the configured ping_interval (default: 2 minutes). The client must respond with a PONG. If the server sends ping_max consecutive PINGs (default: 2) without receiving a PONG, it closes the connection and increments the stale_connections counter. The total detection time is approximately ping_interval × (ping_max + 1).
Yes, in two ways. First, the server maintains memory for each connection’s read/write buffers and subscription state — stale connections consume these resources without providing value. Second, if the stale client was subscribed to active subjects, the server buffers outbound messages for the dead client until it detects the stale state, consuming additional memory. In high-throughput environments, this buffering can be significant during the detection window.
Yes. Use the server’s /connz endpoint to find the connection ID, then use the admin API to close it:
# Find the connectioncurl -s http://localhost:8222/connz?sort=idle&limit=5 | jq '.connections[] | {cid, name, idle}'
# The NATS CLI does not have a direct "kick" command, but you can# close connections via the system account using request/replynats request '$SYS.REQ.SERVER.PING.CONNZ' '' --replies 1For programmatic connection management, use account-level connection limits or the system account API.
Be cautious. A very short interval (e.g., 5 seconds) generates significant control traffic in deployments with thousands of connections. It also increases the risk of false positives — a client that’s temporarily slow (not dead) may miss a ping deadline and get disconnected unnecessarily. A ping_interval of 20-30 seconds with ping_max: 2 is a reasonable balance between detection speed and reliability for most deployments.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community