Checks/CONN_003

NATS Connection Stopped: What It Means and How to Fix It

Severity
Info
Category
Errors
Applies to
Connection
Check ID
CONN_003
Detection threshold
connection closed with a non-empty stop reason

A connection stopped event means a NATS client connection was closed and the server recorded a specific reason for the disconnection. While individual disconnections are normal in any distributed system, patterns of unexpected stops — slow consumer evictions, authentication failures, or repeated timeouts — signal problems that need investigation.

Why this matters

Not all disconnections are equal. A graceful client shutdown is expected. A slow consumer eviction means messages were lost. An authentication failure means a credential is wrong or expired. The stop reason is the diagnostic signal that tells you which category you’re dealing with — without it, all disconnections look the same in your monitoring dashboards.

Connection stopped events become critical when they indicate a pattern. A single auth failure is a typo. Fifty auth failures from the same client name in an hour is a misconfigured deployment or a credential that was rotated without updating all consumers. A steady stream of slow consumer stops means your subscriber architecture can’t keep up with your publish rate. The stop reason transforms a generic “connection count dropped” alert into an actionable diagnosis.

In multi-tenant deployments using accounts, connection stopped tracking is essential for tenant-level SLA monitoring. A tenant whose connections are repeatedly stopped for slow consumer reasons is experiencing data loss that they may not even be aware of. Proactively surfacing these events — with the specific reason — demonstrates operational maturity and prevents surprise escalations.

Common causes

  • Slow consumer eviction. The client couldn’t read messages fast enough, the server’s outbound buffer filled, and the server disconnected the client to protect system throughput. This is the most operationally significant stop reason because it implies message loss for core NATS subscribers.

  • Authentication or authorization failure. The client presented invalid credentials, an expired JWT, or attempted to subscribe to or publish on a subject it doesn’t have permission for. The server closes the connection immediately with a reason like Authentication Failure or Authorization Violation.

  • Client ping/pong timeout. NATS uses periodic ping/pong frames to detect dead connections. If a client fails to respond to pings within the configured interval (default: two missed pings at 2-minute intervals), the server closes the connection. This typically indicates the client process is hung, frozen by GC, or the network path has failed.

  • Server-initiated shutdown or lame duck mode. When a server enters lame duck mode for graceful shutdown, it sends a lame duck advisory and stops accepting new connections. Existing connections drain over the lame duck grace period. Connections still active when the grace period expires are forcibly closed.

  • Max connections or max payload exceeded. The server rejected the connection because the configured maximum was reached, or a client attempted to publish a message exceeding the max payload size. The stop reason identifies which limit was hit.

  • TLS handshake failure. The client and server couldn’t negotiate a TLS connection — expired certificate, untrusted CA, protocol version mismatch. The connection is closed during the handshake phase with a TLS-related reason.

How to diagnose

View recently stopped connections

The NATS server tracks recently closed connections. Query them via the monitoring endpoint:

Terminal window
curl -s http://localhost:8222/connz?state=closed&sort=stop&limit=20 | \
jq '.connections[] | {cid, name, ip, reason, stop}'

Key fields in the response:

  • reason — Why the connection was closed (e.g., Slow Consumer, Authentication Failure)
  • stop — Timestamp of the disconnection
  • name — Client name (if set at connect time)

Watch for disconnect events in real time

The NATS server publishes system events for connection lifecycle changes:

Terminal window
nats sub '$SYS.ACCOUNT.*.CONNECT'

This shows both connect and disconnect events as they happen, including the disconnect reason.

Categorize stops by reason

If you’re seeing many stopped connections, group them by reason to find the dominant pattern:

Terminal window
curl -s http://localhost:8222/connz?state=closed&limit=1000 | \
jq '[.connections[].reason] | group_by(.) | map({reason: .[0], count: length}) | sort_by(-.count)'

A cluster of Slow Consumer reasons points to throughput issues (see SERVER_004). A cluster of Authentication Failure reasons points to credential problems. Mixed reasons with high volume suggests connection churn (see CLUSTER_006).

Check server logs for context

Server logs provide additional context around disconnection events that the API alone may not capture:

Terminal window
# Look for disconnect-related warnings
grep -E "Slow Consumer|Authentication|Authorization|TLS" nats-server.log | tail -20

How to fix it

Immediate: understand the pattern

Review the stop reason in the detail column. The connection disconnected with a non-empty stop reason. Common reasons and their implications:

  • Slow Consumer - Loss — the client could not keep up with the message rate; messages in the server’s outbound buffer were lost
  • Authentication Failure — invalid or expired credentials presented at connect time
  • Server Shutdown — planned maintenance or lame duck transition
  • Maximum Connections Exceeded — the server or account connection limit was reached

Graceful client shutdowns, planned server restarts, and lame duck transitions are normal operational events. Focus investigation on unexpected stop reasons.

Check for credential issues. If the stop reason is authentication or authorization related, verify the client’s credentials:

Terminal window
# Verify a user credential file
nats server check credential --credential /path/to/user.creds
1
// Go — connect with explicit error handling for auth failures
2
nc, err := nats.Connect(url,
3
nats.UserCredentials("/path/to/user.creds"),
4
nats.ErrorHandler(func(nc *nats.Conn, sub *nats.Subscription, err error) {
5
log.Printf("NATS error: %v", err)
6
}),
7
)
8
if err != nil {
9
log.Fatalf("Connection failed: %v", err)
10
}

Short-term: handle disconnections gracefully

Implement robust reconnection logic with disconnect callbacks. Every production client should handle disconnections explicitly:

1
// Go — full lifecycle callbacks
2
nc, err := nats.Connect(url,
3
nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
4
log.Printf("Disconnected: %v", err)
5
}),
6
nats.ReconnectHandler(func(nc *nats.Conn) {
7
log.Printf("Reconnected to %s", nc.ConnectedUrl())
8
}),
9
nats.ClosedHandler(func(nc *nats.Conn) {
10
log.Printf("Connection closed: %v", nc.LastError())
11
}),
12
nats.MaxReconnects(-1), // infinite reconnect attempts
13
nats.ReconnectWait(2*time.Second), // wait between attempts
14
)
1
# Python (nats.py) — reconnection handlers
2
async def disconnected_cb():
3
print("Disconnected from NATS")
4
5
async def reconnected_cb():
6
print("Reconnected to NATS")
7
8
nc = await nats.connect(
9
servers=["nats://localhost:4222"],
10
disconnected_cb=disconnected_cb,
11
reconnected_cb=reconnected_cb,
12
max_reconnect_attempts=-1,
13
reconnect_time_wait=2,
14
)

For slow consumer stops, address the throughput issue. If the dominant stop reason is slow consumer eviction, see the Slow Consumers (SERVER_004) check page for detailed remediation — the fix is in the subscriber’s processing throughput, not in reconnection logic.

Long-term: prevent unexpected disconnections

Use JetStream for data flows where disconnection means data loss. Core NATS subscribers lose all in-flight messages when disconnected. JetStream consumers resume from their last acknowledged position on reconnect, making disconnections a latency event rather than a data loss event.

Synadia Insights tracks every connection stop event with its reason, automatically attributing disconnections to specific accounts, users, and servers across your entire deployment — no manual log parsing required.

Enforce client naming conventions. Every production client should set a descriptive connection name. Without names, stopped connections in connz output are anonymous IP:port entries that require network-level correlation to identify:

1
nc, _ := nats.Connect(url,
2
nats.Name("billing-service-pod-3a7f"),
3
)

Frequently asked questions

How long does the server keep closed connection records?

The NATS server retains the last 10,000 closed connections by default. This is controlled by the max_closed_clients configuration option. Once the limit is reached, the oldest entries are evicted. For high-connection-count deployments, you may need to increase this or export connection events to an external system for longer-term retention.

Is a connection stopped event the same as a connection error?

Not necessarily. A stopped connection has a recorded reason, which may be informational (graceful shutdown) or problematic (slow consumer eviction, auth failure). The stop reason is what distinguishes expected lifecycle events from actual errors. Connections that disappear without a stop record — network partitions, client crashes — show up differently as stale connections (SERVER_012) detected via ping/pong timeout.

How do I distinguish planned disconnections from unexpected ones?

Planned disconnections have predictable reasons: Client Closed for graceful shutdowns, lame duck events during rolling upgrades. Unexpected disconnections have reasons like Slow Consumer, Authentication Failure, or Maximum Connections Exceeded. Filter your monitoring on the unexpected reasons to reduce noise from normal operations.

Can I get notified in real time when a specific client disconnects?

Yes. Subscribe to the system event subjects for connection events. The server publishes disconnect advisories to $SYS.ACCOUNT.<account_id>.DISCONNECT which include the client name, reason, and connection metadata. Your monitoring service can subscribe to these subjects to trigger alerts for specific client names or patterns.

Why do I see connection stopped events during a rolling upgrade?

During a rolling upgrade, each server enters lame duck mode before shutting down. Connected clients receive a lame duck notification and begin reconnecting to other servers. Connections that don’t drain before the lame duck grace period expires are forcibly closed. These stops are expected — the number should match approximately the connection count on the upgraded server, and all clients should successfully reconnect to remaining cluster members.

Proactive monitoring for NATS connection stopped with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel