Checks/SERVER_001

NATS Connection Readiness Failure: What It Means and How to Fix It

Severity
Critical
Category
Health
Applies to
Server
Check ID
SERVER_001
Detection threshold
healthz endpoint reports connection readiness failure

A connection readiness failure means a NATS server’s healthz endpoint is reporting that the server cannot accept client connections. The server process may be running, but it is not ready to serve traffic — making it effectively offline from the perspective of every client and orchestration system that depends on it.

Why this matters

Health checks are the foundation of automated infrastructure. Kubernetes liveness and readiness probes, load balancer health checks, and monitoring systems all rely on the healthz endpoint to determine whether a server should receive traffic. When a server reports a connection readiness failure, orchestrators stop routing clients to it. If multiple servers in a cluster fail readiness simultaneously, client capacity drops proportionally — or disappears entirely.

The failure is often silent from the client’s perspective. Clients that were already connected may continue operating normally (existing TCP connections are unaffected), but no new connections can be established. In a rolling restart scenario, this means the server that just restarted never becomes ready, and the next server in the rotation cannot be safely drained. The rolling restart stalls, leaving the cluster in a partially-upgraded state.

Connection readiness failures frequently indicate a configuration problem that will not resolve on its own. A port conflict with another process, an expired or malformed TLS certificate, or a filesystem permission issue on the JetStream store directory will persist through restarts until the root cause is addressed. Blindly restarting the server — which is the natural first instinct — often makes things worse by adding restart churn without fixing anything.

Common causes

  • Listener port conflict. Another process (or another NATS instance) is already bound to the configured client port. The server starts, attempts to bind, fails, and reports not ready. This is especially common in containerized environments where port mappings collide or when running multiple NATS instances on the same host.

  • TLS certificate errors. The server cannot load its TLS certificate or key file — the file is missing, the permissions are wrong, the certificate has expired, or the certificate chain is incomplete. TLS errors prevent the listener from starting entirely, so the server never reaches a ready state.

  • Filesystem permission issues. The NATS process cannot write to the JetStream store directory, the PID file location, or the log directory. This happens after OS upgrades that change default permissions, or when running NATS under a different user than the one that owns the data directory.

  • JetStream store recovery failure. The server is attempting to recover JetStream state from disk but encounters corrupt WAL files, missing stream directories, or insufficient disk space. The healthz endpoint reports not ready until JetStream recovery completes, and it cannot complete if the store is damaged.

  • Account resolver timeout. In operator mode with a NATS-based account resolver, the server must resolve account JWTs before it can accept connections for those accounts. If the resolver is unreachable — because the account server is down, or the server’s own cluster connectivity isn’t established yet — the server remains in a not-ready state.

  • Cluster route establishment failure. The server is configured to join a cluster but cannot establish routes to any peers. Some configurations block readiness until at least one route is connected, especially when connect_retries is exhausted.

How to diagnose

Check the healthz endpoint directly

Terminal window
# From the server host or a machine with network access
curl http://localhost:8222/healthz

A healthy server returns {"status":"ok"}. A failing server returns a non-200 status code with a reason field describing the specific failure. The reason string is your primary diagnostic signal.

Check server logs for startup errors

Terminal window
# Look for errors in the NATS server log
journalctl -u nats-server --since "10 minutes ago" --no-pager | grep -i "error\|fatal\|fail\|cannot"

The server logs the specific reason it cannot start the listener. Common log patterns:

  • Listen: listen tcp :4222: bind: address already in use — port conflict
  • TLS config error — certificate loading failure
  • JetStream store directory not accessible — permission issue

Check for port conflicts

Terminal window
# Find what's using the NATS client port
lsof -i :4222
# Or with ss
ss -tlnp | grep 4222

If another process holds the port, you’ve found the problem.

Verify TLS certificate validity

Terminal window
# Check certificate expiration
openssl x509 -in /path/to/server-cert.pem -noout -dates
# Verify the certificate chain
openssl verify -CAfile /path/to/ca.pem /path/to/server-cert.pem

Check JetStream store directory

Terminal window
# Verify the directory exists and the NATS user can write to it
ls -la /path/to/jetstream/store/
# Check disk space
df -h /path/to/jetstream/store/

Inspect the server’s overall health via NATS CLI

If the server is partially running (monitoring port accessible but client port not):

Terminal window
nats server info --server nats://localhost:4222

If the client port is down, use the monitoring endpoint:

Terminal window
curl http://localhost:8222/varz | jq '.health, .start'

How to fix it

Immediate: identify and resolve the blocking condition

Port conflict resolution:

Terminal window
# Find and stop the conflicting process
lsof -i :4222
kill <conflicting_pid>
# Then restart NATS
systemctl restart nats-server

TLS certificate fix:

Terminal window
# If the certificate expired, replace it and reload
# (reload doesn't require a full restart)
nats-server --signal reload

For automated certificate rotation, use a tool like cert-manager (Kubernetes) or a cron job that renews certificates and sends a reload signal.

Permission fix:

Terminal window
# Fix ownership on the JetStream store directory
chown -R nats:nats /path/to/jetstream/store/
chmod 750 /path/to/jetstream/store/

Short-term: add pre-flight validation

Write a pre-start script that validates the environment before launching the NATS server. This catches problems before they become health check failures:

1
package main
2
3
import (
4
"crypto/tls"
5
"fmt"
6
"net"
7
"os"
8
)
9
10
func main() {
11
// Check port availability
12
ln, err := net.Listen("tcp", ":4222")
13
if err != nil {
14
fmt.Fprintf(os.Stderr, "port 4222 not available: %v\n", err)
15
os.Exit(1)
16
}
17
ln.Close()
18
19
// Check TLS cert loading
20
certFile := os.Getenv("NATS_TLS_CERT")
21
keyFile := os.Getenv("NATS_TLS_KEY")
22
if certFile != "" && keyFile != "" {
23
_, err := tls.LoadX509KeyPair(certFile, keyFile)
24
if err != nil {
25
fmt.Fprintf(os.Stderr, "TLS cert error: %v\n", err)
26
os.Exit(1)
27
}
28
}
29
30
// Check JetStream store directory
31
storeDir := os.Getenv("NATS_JETSTREAM_STORE")
32
if storeDir != "" {
33
info, err := os.Stat(storeDir)
34
if err != nil || !info.IsDir() {
35
fmt.Fprintf(os.Stderr, "store dir not accessible: %v\n", err)
36
os.Exit(1)
37
}
38
}
39
40
fmt.Println("pre-flight checks passed")
41
}
preflight_check.py
1
import socket
2
import ssl
3
import os
4
import sys
5
6
def check_port(port: int) -> bool:
7
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
8
try:
9
s.bind(("", port))
10
return True
11
except OSError as e:
12
print(f"Port {port} not available: {e}", file=sys.stderr)
13
return False
14
15
def check_tls(cert_path: str, key_path: str) -> bool:
16
try:
17
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
18
ctx.load_cert_chain(cert_path, key_path)
19
return True
20
except ssl.SSLError as e:
21
print(f"TLS cert error: {e}", file=sys.stderr)
22
return False
23
24
def check_store_dir(path: str) -> bool:
25
if not os.path.isdir(path):
26
print(f"Store dir not accessible: {path}", file=sys.stderr)
27
return False
28
if not os.access(path, os.W_OK):
29
print(f"Store dir not writable: {path}", file=sys.stderr)
30
return False
31
return True
32
33
if __name__ == "__main__":
34
ok = check_port(4222)
35
cert = os.environ.get("NATS_TLS_CERT", "")
36
key = os.environ.get("NATS_TLS_KEY", "")
37
if cert and key:
38
ok = check_tls(cert, key) and ok
39
store = os.environ.get("NATS_JETSTREAM_STORE", "")
40
if store:
41
ok = check_store_dir(store) and ok
42
sys.exit(0 if ok else 1)

Long-term: automate certificate management and monitoring

Set up certificate expiration alerts. Monitor TLS certificate expiration dates and alert at least 30 days before expiry. In Prometheus with the blackbox exporter.

Automate healthz monitoring. Configure your orchestrator to use the healthz endpoint as both a liveness and readiness probe:

1
# Kubernetes deployment snippet
2
livenessProbe:
3
httpGet:
4
path: /healthz
5
port: 8222
6
initialDelaySeconds: 10
7
periodSeconds: 30
8
readinessProbe:
9
httpGet:
10
path: /healthz?js-enabled-only=true
11
port: 8222
12
initialDelaySeconds: 5
13
periodSeconds: 10

Synadia Insights evaluates the healthz endpoint every collection epoch and fires this check automatically — but local probes ensure faster remediation by your orchestrator.

Frequently asked questions

Why does my server process start but healthz returns a failure?

The NATS server binary can start and bind to the monitoring port (8222) before the client listener (4222) is ready. If the client listener fails — due to a port conflict, TLS error, or permission issue — the monitoring port continues serving, but healthz reports not ready. Check server logs for the specific listener error.

How is this different from the JetStream Subsystem Unhealthy check?

Connection Readiness Failure (SERVER_001) covers the server’s ability to accept any client connections at all — it’s a fundamental readiness check. JetStream Subsystem Unhealthy (SERVER_014) specifically flags JetStream-related health failures like meta leader contact loss or stream recovery issues. A server can pass SERVER_001 (accepting connections) while failing SERVER_014 (JetStream not ready), or fail both simultaneously if JetStream recovery is blocking overall readiness.

Should I restart the server when this check fires?

Not immediately. Restarting without identifying the root cause often produces the same failure again, adding unnecessary churn. First diagnose using the healthz response reason and server logs. Fix the underlying issue — port conflict, certificate, permissions — then restart if needed. Many issues (like TLS certificate replacement) can be resolved with a config reload (nats-server --signal reload) instead of a full restart.

Can connection readiness failures cascade across a cluster?

Yes. In a rolling restart scenario, if the restarted server never becomes ready, the cluster is operating with reduced capacity. If the orchestrator proceeds to restart the next server anyway (because the first is technically “running”), you can end up with multiple servers in a not-ready state simultaneously. Always gate rolling restarts on readiness — don’t restart the next server until the previous one passes healthz.

How do I monitor healthz in Prometheus?

Use the NATS Prometheus exporter or a blackbox probe against the healthz endpoint.

Proactive monitoring for NATS connection readiness failure with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel