A NATS server restart is detected when the server’s start time changes between consecutive collection epochs. This indicates the server process was stopped and started again — either through a planned maintenance action or an unexpected crash. Restarts that come with a server version change are treated as planned upgrades and excluded from this check, so any restart Insights surfaces here is one that happened without a version bump and deserves investigation.
A server restart is not a minor event. When a NATS server stops, every client connected to it disconnects simultaneously. Even with automatic reconnect enabled in client libraries (the default), there’s a window where messages can be lost. Core NATS subscribers miss every message published while they’re disconnected. JetStream consumers pause delivery until they reconnect and resume. Request-reply chains break mid-flight, causing timeouts in upstream services.
In a clustered deployment, the impact extends beyond the restarted server. Route connections between cluster members drop and re-establish. JetStream Raft groups that had replicas on the restarted server lose a voter — if the cluster was already running with a degraded replica count, the restart can cause quorum loss. The meta cluster may need to re-elect a leader. Stream leaders on the restarted server fail over to other nodes, causing a brief pause in writes. When the server comes back, it needs to catch up on Raft state, which generates additional replication traffic.
The critical severity reflects the fact that an unexplained restart is potentially the beginning of a larger problem. A server that restarted once due to OOM will restart again when memory pressure returns. A server that crashed due to a corrupt JetStream store may crash loop until the corruption is resolved. Catching the first restart and investigating it is how you prevent the next one.
Planned maintenance or upgrade. The most benign cause — an operator intentionally restarted the server to apply a configuration change that requires a restart, or to upgrade to a new NATS server version. In well-run environments, planned restarts use lame duck mode to drain connections gracefully before stopping.
OOM kill by the operating system. The server consumed more memory than the OS allows (cgroup limit in containers, system memory on bare metal). The OS kernel OOM killer terminates the process immediately — no graceful shutdown, no client notification. If the server is managed by systemd or Kubernetes, it restarts automatically, and the only evidence is a changed start time.
Disk full causing JetStream panic. When JetStream can’t write to its storage directory (write-ahead logs, snapshots, stream data), it can crash the server process. The server may restart successfully if disk space was freed in the interim, but will crash again if the underlying storage issue isn’t resolved.
Bad configuration change. A configuration reload introduced an error that the server accepted but that causes a crash under certain conditions — for example, an invalid TLS certificate path that only fails when a new client connects, or a JetStream store path that doesn’t exist.
Infrastructure event. The host machine rebooted (kernel update, hardware failure, cloud provider maintenance), the container was rescheduled (Kubernetes pod eviction, node drain), or the VM was live-migrated.
Unhandled runtime error. A bug in the NATS server or an edge case in the deployment triggers a panic. These are rare in stable releases but can occur with pre-release versions or unusual configurations.
# Show server start time and uptimenats server info
# List all servers with uptime — recently restarted servers have short uptimenats server listCompare the start time against your maintenance schedule. If no maintenance was planned, this is an unexpected restart.
curl -s http://localhost:8222/varz | jq '{start: .start, uptime: .uptime, config_load_time: .config_load_time}'The start field shows the exact timestamp. The config_load_time tells you if a config reload happened after the restart — useful for distinguishing “restart for config change” from “crash and auto-restart.”
The logs before the restart explain why it stopped. The logs after explain how it recovered:
# systemd-managed serverjournalctl -u nats-server --since "2 hours ago" --no-pager
# Log filetail -1000 /var/log/nats/nats-server.logLook for:
Initiating Shutdown — graceful stop (planned)Shutting down with signal info — operator-initiatedFATAL or panic — crashOOM in system logs — killed by kernelLame Duck Mode — graceful drain before stop# Check for OOM kill events (Linux)dmesg | grep -i "oom.*nats"journalctl -k | grep -i "oom"# System uptime — if this is also short, the host rebooteduptime
# Last system boot timewho -bIf the host rebooted, the server restart is a consequence, not the root cause. Investigate the host reboot separately.
# Confirm the server is healthy after restartcurl -s http://localhost:8222/healthz
# Check JetStream caught upnats server report jetstream
# Verify cluster routes are re-establishednats server listAfter detecting a restart, confirm the server is back to healthy operation:
# Confirm healthynats server check connection
# Verify cluster membershipnats server list
# Check JetStream Raft groups recoverednats server report jetstreamIf the server is not healthy after restart, it may be heading toward a crash loop. Address the underlying issue before the next crash.
If OOM killed, increase memory limits or reduce memory usage:
1// Go client — monitor server stats to catch memory pressure early2nc, err := nats.Connect(url)3if err != nil {4 log.Fatal(err)5}6
7resp, err := nc.Request("$SYS.REQ.SERVER.PING", nil, time.Second)8if err != nil {9 log.Fatal(err)10}11log.Printf("Server response: %s", string(resp.Data))1# Python — request server info via system subject2import nats3
4async def check_server():5 nc = await nats.connect("nats://localhost:4222")6 resp = await nc.request("$SYS.REQ.SERVER.PING", b"", timeout=1)7 print(f"Server info: {resp.data.decode()}")8 await nc.close()If disk full, add storage capacity and set JetStream retention limits to prevent unbounded growth:
# Check current JetStream storage usagenats server report jetstream
# Add retention limits to streams without themnats stream edit <stream_name> --max-bytes 10GB --max-age 72hIf config-related, validate configs before applying:
# Test configuration before applyingnats-server --config /etc/nats/nats-server.conf -tUse lame duck mode for planned restarts. Lame duck mode tells the server to stop accepting new connections and wait for existing connections to migrate to other servers before shutting down:
# Initiate graceful drain before stoppingnats-server --signal ldm
# Wait for connections to drain, then stopnats-server --signal quitSet up restart alerting. Insights detects restarts automatically by comparing start_time across epochs. For custom alerting, monitor the start field from /varz.
Automate post-restart validation. After any restart (planned or not), automatically verify cluster health, Raft convergence, and JetStream stream health before routing production traffic back to the server.
No. Planned restarts for upgrades, configuration changes, and maintenance are expected. The check fires because unplanned restarts look identical from a monitoring perspective — the start time changes either way. The purpose is to ensure every restart is accounted for. If it was planned, acknowledge it and move on. If it wasn’t, investigate immediately.
Most NATS client libraries have automatic reconnect enabled by default. The client detects the connection drop, buffers outgoing messages (up to a configured limit), and attempts to reconnect to the same or another server in its connection URL list. However, automatic reconnect doesn’t prevent message loss for core NATS subscribers — messages published while the client is disconnected are gone. JetStream consumers can resume from their last acknowledged position after reconnecting.
Use lame duck mode (nats-server --signal ldm). The server stops accepting new connections and advertises to clients that they should reconnect to other servers. Once connections drain, stop the server with nats-server --signal quit. For JetStream streams with replication (R3), the remaining replicas continue serving reads and writes during the restart. For R1 streams, there is no redundancy — writes will fail during the restart window.
It depends on the amount of JetStream state. A server with a few streams recovers in seconds. A server with thousands of streams and consumers, or large Raft state, can take minutes to fully catch up. During recovery, the health check (/healthz) may report errors as Raft groups re-establish quorum. Monitor nats server report jetstream to track when all streams return to their expected replica count and leadership is stable.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community