Checks/SERVER_008

NATS Server Restarted: What It Means and How to Fix It

Severity
Critical
Category
Change
Applies to
Server
Check ID
SERVER_008
Detection threshold
start_time changed between collection epochs (excludes restarts accompanied by a version change — those are treated as planned upgrades)

A NATS server restart is detected when the server’s start time changes between consecutive collection epochs. This indicates the server process was stopped and started again — either through a planned maintenance action or an unexpected crash. Restarts that come with a server version change are treated as planned upgrades and excluded from this check, so any restart Insights surfaces here is one that happened without a version bump and deserves investigation.

Why this matters

A server restart is not a minor event. When a NATS server stops, every client connected to it disconnects simultaneously. Even with automatic reconnect enabled in client libraries (the default), there’s a window where messages can be lost. Core NATS subscribers miss every message published while they’re disconnected. JetStream consumers pause delivery until they reconnect and resume. Request-reply chains break mid-flight, causing timeouts in upstream services.

In a clustered deployment, the impact extends beyond the restarted server. Route connections between cluster members drop and re-establish. JetStream Raft groups that had replicas on the restarted server lose a voter — if the cluster was already running with a degraded replica count, the restart can cause quorum loss. The meta cluster may need to re-elect a leader. Stream leaders on the restarted server fail over to other nodes, causing a brief pause in writes. When the server comes back, it needs to catch up on Raft state, which generates additional replication traffic.

The critical severity reflects the fact that an unexplained restart is potentially the beginning of a larger problem. A server that restarted once due to OOM will restart again when memory pressure returns. A server that crashed due to a corrupt JetStream store may crash loop until the corruption is resolved. Catching the first restart and investigating it is how you prevent the next one.

Common causes

  • Planned maintenance or upgrade. The most benign cause — an operator intentionally restarted the server to apply a configuration change that requires a restart, or to upgrade to a new NATS server version. In well-run environments, planned restarts use lame duck mode to drain connections gracefully before stopping.

  • OOM kill by the operating system. The server consumed more memory than the OS allows (cgroup limit in containers, system memory on bare metal). The OS kernel OOM killer terminates the process immediately — no graceful shutdown, no client notification. If the server is managed by systemd or Kubernetes, it restarts automatically, and the only evidence is a changed start time.

  • Disk full causing JetStream panic. When JetStream can’t write to its storage directory (write-ahead logs, snapshots, stream data), it can crash the server process. The server may restart successfully if disk space was freed in the interim, but will crash again if the underlying storage issue isn’t resolved.

  • Bad configuration change. A configuration reload introduced an error that the server accepted but that causes a crash under certain conditions — for example, an invalid TLS certificate path that only fails when a new client connects, or a JetStream store path that doesn’t exist.

  • Infrastructure event. The host machine rebooted (kernel update, hardware failure, cloud provider maintenance), the container was rescheduled (Kubernetes pod eviction, node drain), or the VM was live-migrated.

  • Unhandled runtime error. A bug in the NATS server or an edge case in the deployment triggers a panic. These are rare in stable releases but can occur with pre-release versions or unusual configurations.

How to diagnose

Confirm the restart and check timing

Terminal window
# Show server start time and uptime
nats server info
# List all servers with uptime — recently restarted servers have short uptime
nats server list

Compare the start time against your maintenance schedule. If no maintenance was planned, this is an unexpected restart.

Check the monitoring endpoint for uptime

Terminal window
curl -s http://localhost:8222/varz | jq '{start: .start, uptime: .uptime, config_load_time: .config_load_time}'

The start field shows the exact timestamp. The config_load_time tells you if a config reload happened after the restart — useful for distinguishing “restart for config change” from “crash and auto-restart.”

Investigate server logs

The logs before the restart explain why it stopped. The logs after explain how it recovered:

Terminal window
# systemd-managed server
journalctl -u nats-server --since "2 hours ago" --no-pager
# Log file
tail -1000 /var/log/nats/nats-server.log

Look for:

  • Initiating Shutdown — graceful stop (planned)
  • Shutting down with signal info — operator-initiated
  • FATAL or panic — crash
  • OOM in system logs — killed by kernel
  • Lame Duck Mode — graceful drain before stop
Terminal window
# Check for OOM kill events (Linux)
dmesg | grep -i "oom.*nats"
journalctl -k | grep -i "oom"

Check if it was a host-level event

Terminal window
# System uptime — if this is also short, the host rebooted
uptime
# Last system boot time
who -b

If the host rebooted, the server restart is a consequence, not the root cause. Investigate the host reboot separately.

Verify server health post-restart

Terminal window
# Confirm the server is healthy after restart
curl -s http://localhost:8222/healthz
# Check JetStream caught up
nats server report jetstream
# Verify cluster routes are re-established
nats server list

How to fix it

Immediate: verify recovery

After detecting a restart, confirm the server is back to healthy operation:

Terminal window
# Confirm healthy
nats server check connection
# Verify cluster membership
nats server list
# Check JetStream Raft groups recovered
nats server report jetstream

If the server is not healthy after restart, it may be heading toward a crash loop. Address the underlying issue before the next crash.

Short-term: prevent unexpected restarts

If OOM killed, increase memory limits or reduce memory usage:

1
// Go client — monitor server stats to catch memory pressure early
2
nc, err := nats.Connect(url)
3
if err != nil {
4
log.Fatal(err)
5
}
6
7
resp, err := nc.Request("$SYS.REQ.SERVER.PING", nil, time.Second)
8
if err != nil {
9
log.Fatal(err)
10
}
11
log.Printf("Server response: %s", string(resp.Data))
1
# Python — request server info via system subject
2
import nats
3
4
async def check_server():
5
nc = await nats.connect("nats://localhost:4222")
6
resp = await nc.request("$SYS.REQ.SERVER.PING", b"", timeout=1)
7
print(f"Server info: {resp.data.decode()}")
8
await nc.close()

If disk full, add storage capacity and set JetStream retention limits to prevent unbounded growth:

Terminal window
# Check current JetStream storage usage
nats server report jetstream
# Add retention limits to streams without them
nats stream edit <stream_name> --max-bytes 10GB --max-age 72h

If config-related, validate configs before applying:

Terminal window
# Test configuration before applying
nats-server --config /etc/nats/nats-server.conf -t

Long-term: graceful operations

Use lame duck mode for planned restarts. Lame duck mode tells the server to stop accepting new connections and wait for existing connections to migrate to other servers before shutting down:

Terminal window
# Initiate graceful drain before stopping
nats-server --signal ldm
# Wait for connections to drain, then stop
nats-server --signal quit

Set up restart alerting. Insights detects restarts automatically by comparing start_time across epochs. For custom alerting, monitor the start field from /varz.

Automate post-restart validation. After any restart (planned or not), automatically verify cluster health, Raft convergence, and JetStream stream health before routing production traffic back to the server.

Frequently asked questions

Is a server restart always a problem?

No. Planned restarts for upgrades, configuration changes, and maintenance are expected. The check fires because unplanned restarts look identical from a monitoring perspective — the start time changes either way. The purpose is to ensure every restart is accounted for. If it was planned, acknowledge it and move on. If it wasn’t, investigate immediately.

Do clients automatically reconnect after a server restart?

Most NATS client libraries have automatic reconnect enabled by default. The client detects the connection drop, buffers outgoing messages (up to a configured limit), and attempts to reconnect to the same or another server in its connection URL list. However, automatic reconnect doesn’t prevent message loss for core NATS subscribers — messages published while the client is disconnected are gone. JetStream consumers can resume from their last acknowledged position after reconnecting.

How do I restart a NATS server without losing messages?

Use lame duck mode (nats-server --signal ldm). The server stops accepting new connections and advertises to clients that they should reconnect to other servers. Once connections drain, stop the server with nats-server --signal quit. For JetStream streams with replication (R3), the remaining replicas continue serving reads and writes during the restart. For R1 streams, there is no redundancy — writes will fail during the restart window.

How long does JetStream recovery take after a restart?

It depends on the amount of JetStream state. A server with a few streams recovers in seconds. A server with thousands of streams and consumers, or large Raft state, can take minutes to fully catch up. During recovery, the health check (/healthz) may report errors as Raft groups re-establish quorum. Monitor nats server report jetstream to track when all streams return to their expected replica count and leadership is stable.

Proactive monitoring for NATS server restarted with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel