A connection readiness failure means a NATS server’s healthz endpoint is reporting that the server cannot accept client connections. The server process may be running, but it is not ready to serve traffic — making it effectively offline from the perspective of every client and orchestration system that depends on it.
Health checks are the foundation of automated infrastructure. Kubernetes liveness and readiness probes, load balancer health checks, and monitoring systems all rely on the healthz endpoint to determine whether a server should receive traffic. When a server reports a connection readiness failure, orchestrators stop routing clients to it. If multiple servers in a cluster fail readiness simultaneously, client capacity drops proportionally — or disappears entirely.
The failure is often silent from the client’s perspective. Clients that were already connected may continue operating normally (existing TCP connections are unaffected), but no new connections can be established. In a rolling restart scenario, this means the server that just restarted never becomes ready, and the next server in the rotation cannot be safely drained. The rolling restart stalls, leaving the cluster in a partially-upgraded state.
Connection readiness failures frequently indicate a configuration problem that will not resolve on its own. A port conflict with another process, an expired or malformed TLS certificate, or a filesystem permission issue on the JetStream store directory will persist through restarts until the root cause is addressed. Blindly restarting the server — which is the natural first instinct — often makes things worse by adding restart churn without fixing anything.
Listener port conflict. Another process (or another NATS instance) is already bound to the configured client port. The server starts, attempts to bind, fails, and reports not ready. This is especially common in containerized environments where port mappings collide or when running multiple NATS instances on the same host.
TLS certificate errors. The server cannot load its TLS certificate or key file — the file is missing, the permissions are wrong, the certificate has expired, or the certificate chain is incomplete. TLS errors prevent the listener from starting entirely, so the server never reaches a ready state.
Filesystem permission issues. The NATS process cannot write to the JetStream store directory, the PID file location, or the log directory. This happens after OS upgrades that change default permissions, or when running NATS under a different user than the one that owns the data directory.
JetStream store recovery failure. The server is attempting to recover JetStream state from disk but encounters corrupt WAL files, missing stream directories, or insufficient disk space. The healthz endpoint reports not ready until JetStream recovery completes, and it cannot complete if the store is damaged.
Account resolver timeout. In operator mode with a NATS-based account resolver, the server must resolve account JWTs before it can accept connections for those accounts. If the resolver is unreachable — because the account server is down, or the server’s own cluster connectivity isn’t established yet — the server remains in a not-ready state.
Cluster route establishment failure. The server is configured to join a cluster but cannot establish routes to any peers. Some configurations block readiness until at least one route is connected, especially when connect_retries is exhausted.
# From the server host or a machine with network accesscurl http://localhost:8222/healthzA healthy server returns {"status":"ok"}. A failing server returns a non-200 status code with a reason field describing the specific failure. The reason string is your primary diagnostic signal.
# Look for errors in the NATS server logjournalctl -u nats-server --since "10 minutes ago" --no-pager | grep -i "error\|fatal\|fail\|cannot"The server logs the specific reason it cannot start the listener. Common log patterns:
Listen: listen tcp :4222: bind: address already in use — port conflictTLS config error — certificate loading failureJetStream store directory not accessible — permission issue# Find what's using the NATS client portlsof -i :4222# Or with ssss -tlnp | grep 4222If another process holds the port, you’ve found the problem.
# Check certificate expirationopenssl x509 -in /path/to/server-cert.pem -noout -dates
# Verify the certificate chainopenssl verify -CAfile /path/to/ca.pem /path/to/server-cert.pem# Verify the directory exists and the NATS user can write to itls -la /path/to/jetstream/store/# Check disk spacedf -h /path/to/jetstream/store/If the server is partially running (monitoring port accessible but client port not):
nats server info --server nats://localhost:4222If the client port is down, use the monitoring endpoint:
curl http://localhost:8222/varz | jq '.health, .start'Port conflict resolution:
# Find and stop the conflicting processlsof -i :4222kill <conflicting_pid>
# Then restart NATSsystemctl restart nats-serverTLS certificate fix:
# If the certificate expired, replace it and reload# (reload doesn't require a full restart)nats-server --signal reloadFor automated certificate rotation, use a tool like cert-manager (Kubernetes) or a cron job that renews certificates and sends a reload signal.
Permission fix:
# Fix ownership on the JetStream store directorychown -R nats:nats /path/to/jetstream/store/chmod 750 /path/to/jetstream/store/Write a pre-start script that validates the environment before launching the NATS server. This catches problems before they become health check failures:
1package main2
3import (4 "crypto/tls"5 "fmt"6 "net"7 "os"8)9
10func main() {11 // Check port availability12 ln, err := net.Listen("tcp", ":4222")13 if err != nil {14 fmt.Fprintf(os.Stderr, "port 4222 not available: %v\n", err)15 os.Exit(1)16 }17 ln.Close()18
19 // Check TLS cert loading20 certFile := os.Getenv("NATS_TLS_CERT")21 keyFile := os.Getenv("NATS_TLS_KEY")22 if certFile != "" && keyFile != "" {23 _, err := tls.LoadX509KeyPair(certFile, keyFile)24 if err != nil {25 fmt.Fprintf(os.Stderr, "TLS cert error: %v\n", err)26 os.Exit(1)27 }28 }29
30 // Check JetStream store directory31 storeDir := os.Getenv("NATS_JETSTREAM_STORE")32 if storeDir != "" {33 info, err := os.Stat(storeDir)34 if err != nil || !info.IsDir() {35 fmt.Fprintf(os.Stderr, "store dir not accessible: %v\n", err)36 os.Exit(1)37 }38 }39
40 fmt.Println("pre-flight checks passed")41}1import socket2import ssl3import os4import sys5
6def check_port(port: int) -> bool:7 with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:8 try:9 s.bind(("", port))10 return True11 except OSError as e:12 print(f"Port {port} not available: {e}", file=sys.stderr)13 return False14
15def check_tls(cert_path: str, key_path: str) -> bool:16 try:17 ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)18 ctx.load_cert_chain(cert_path, key_path)19 return True20 except ssl.SSLError as e:21 print(f"TLS cert error: {e}", file=sys.stderr)22 return False23
24def check_store_dir(path: str) -> bool:25 if not os.path.isdir(path):26 print(f"Store dir not accessible: {path}", file=sys.stderr)27 return False28 if not os.access(path, os.W_OK):29 print(f"Store dir not writable: {path}", file=sys.stderr)30 return False31 return True32
33if __name__ == "__main__":34 ok = check_port(4222)35 cert = os.environ.get("NATS_TLS_CERT", "")36 key = os.environ.get("NATS_TLS_KEY", "")37 if cert and key:38 ok = check_tls(cert, key) and ok39 store = os.environ.get("NATS_JETSTREAM_STORE", "")40 if store:41 ok = check_store_dir(store) and ok42 sys.exit(0 if ok else 1)Set up certificate expiration alerts. Monitor TLS certificate expiration dates and alert at least 30 days before expiry. In Prometheus with the blackbox exporter.
Automate healthz monitoring. Configure your orchestrator to use the healthz endpoint as both a liveness and readiness probe:
1# Kubernetes deployment snippet2livenessProbe:3 httpGet:4 path: /healthz5 port: 82226 initialDelaySeconds: 107 periodSeconds: 308readinessProbe:9 httpGet:10 path: /healthz?js-enabled-only=true11 port: 822212 initialDelaySeconds: 513 periodSeconds: 10Synadia Insights evaluates the healthz endpoint every collection epoch and fires this check automatically — but local probes ensure faster remediation by your orchestrator.
The NATS server binary can start and bind to the monitoring port (8222) before the client listener (4222) is ready. If the client listener fails — due to a port conflict, TLS error, or permission issue — the monitoring port continues serving, but healthz reports not ready. Check server logs for the specific listener error.
Connection Readiness Failure (SERVER_001) covers the server’s ability to accept any client connections at all — it’s a fundamental readiness check. JetStream Subsystem Unhealthy (SERVER_014) specifically flags JetStream-related health failures like meta leader contact loss or stream recovery issues. A server can pass SERVER_001 (accepting connections) while failing SERVER_014 (JetStream not ready), or fail both simultaneously if JetStream recovery is blocking overall readiness.
Not immediately. Restarting without identifying the root cause often produces the same failure again, adding unnecessary churn. First diagnose using the healthz response reason and server logs. Fix the underlying issue — port conflict, certificate, permissions — then restart if needed. Many issues (like TLS certificate replacement) can be resolved with a config reload (nats-server --signal reload) instead of a full restart.
Yes. In a rolling restart scenario, if the restarted server never becomes ready, the cluster is operating with reduced capacity. If the orchestrator proceeds to restart the next server anyway (because the first is technically “running”), you can end up with multiple servers in a not-ready state simultaneously. Always gate rolling restarts on readiness — don’t restart the next server until the previous one passes healthz.
Use the NATS Prometheus exporter or a blackbox probe against the healthz endpoint.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community