A server version mismatch means one or more NATS servers in your cluster are running a different software version than the majority. This is expected during a rolling upgrade and a problem when it’s unintentional — indicating an incomplete upgrade, a forgotten server, or a misconfigured deployment pipeline.
NATS servers in a cluster communicate via internal routes and coordinate through the Raft consensus protocol for JetStream operations. While NATS maintains strong backward compatibility between minor versions, running a mixed-version cluster introduces risks that grow with the version gap and the time the mismatch persists.
Feature availability becomes inconsistent. A new configuration option, protocol enhancement, or bug fix applied to upgraded servers doesn’t exist on the older ones. If a client or configuration depends on a feature introduced in the newer version, it works on some servers and fails on others — depending on which server the client connects to or which server is the Raft leader. These intermittent, topology-dependent failures are among the hardest to debug.
Bug fixes and security patches are the more urgent concern. If you upgraded three of five servers to address a CVE or a data-handling bug, the two remaining servers are still vulnerable. In a clustered environment, a vulnerability on any server is a vulnerability for the cluster — route connections between servers are trusted. Similarly, a bug that causes incorrect Raft behavior on the older version can affect the entire consensus group, even if the leader is running the patched version.
Rolling upgrades are a routine part of NATS operations, so a temporary mismatch is normal. The problem is when “temporary” becomes “permanent” — when the last server in the upgrade sequence gets forgotten, when a different deployment pipeline manages some nodes, or when a server was rebuilt from an older image. Synadia Insights flags version mismatches so they don’t silently persist.
Incomplete rolling upgrade. The most common cause. An operator upgraded most servers but didn’t finish — interrupted by another task, a weekend, or an error on the remaining servers. The cluster runs in a mixed state indefinitely.
Forgotten server in the cluster. A server deployed months ago, perhaps in a different availability zone or managed by a different team, wasn’t included in the upgrade plan. It continues running the old version unnoticed.
Separate deployment pipelines. Different servers are managed by different automation (Terraform, Ansible, Kubernetes operators) or different teams. One pipeline was updated, the other wasn’t. Common in organizations where infrastructure grew organically.
Staging server accidentally joined production. A development or staging server with a different version was configured with production cluster routes. It joined the cluster and now reports a mismatched version.
Rollback on some nodes. An upgrade was attempted, some servers hit issues and were rolled back, while others stayed on the new version. The rollback was meant to be temporary but became permanent.
nats server listThis shows every server in the cluster with its version, cluster name, and connection count. Look for servers reporting a different version than the majority.
nats server info <server-name>This shows the full version string, Go runtime version, git commit, and configuration details for a specific server.
curl -s http://localhost:8222/varz | jq '{server_name: .server_name, version: .version, go: .go, git_commit: .git_commit}'The /varz endpoint on each server reports its version. This is useful for automated checking across all servers.
# Get versions from all servers in one passnats server list --json | jq -r '.[] | "\(.name)\t\(.ver)"' | sort -k2Group by version to see the split — the majority version is the target, and any outliers need upgrading (or the majority needs rolling back if the outlier is correct).
# Check if servers have been recently restartednats server list --json | jq '.[] | {name: .name, version: .ver, uptime: .uptime}'Servers with recent restarts (short uptime) that are on the newer version indicate an active rolling upgrade. Servers with long uptime on the old version indicate a stalled or forgotten upgrade.
Determine the version gap. Minor version differences (2.10.22 vs 2.10.24) are low risk — they’re typically bug fixes with full wire compatibility. Major version differences (2.9.x vs 2.10.x) carry higher risk due to potential protocol and feature changes.
# Check the version differencenats server list --json | jq '[.[].ver] | unique'If the gap is minor and the cluster is functioning normally, the urgency is low — but complete the upgrade at the next maintenance window.
Use lame duck mode for safe upgrades. Lame duck mode gracefully drains client connections and migrates Raft leadership before the server shuts down:
# Signal the server to enter lame duck modenats-server --signal ldm=<pid>
# Wait for connections to drain (check with server list)nats server list
# Stop the serversystemctl stop nats-server
# Upgrade the binary# (package manager, container image pull, binary replacement)
# Start the new versionsystemctl start nats-server
# Verify it rejoined with the correct versionnats server listUpgrade one server at a time. Wait for each server to fully rejoin the cluster and for all its Raft groups to catch up before proceeding to the next:
# After starting the upgraded server, verify Raft healthnats server report jetstreamAll Raft groups should show the upgraded server as current before moving to the next server.
Verify client reconnection. After each server upgrade, confirm clients reconnected successfully:
nats server report connectionsAutomate server upgrades. Use configuration management (Ansible, Puppet) or container orchestration (Kubernetes with the NATS Helm chart) to ensure all servers are deployed from the same version source:
1// Go: programmatically check server versions2nc, _ := nats.Connect(url)3resp, _ := nc.Request("$SYS.REQ.SERVER.PING", nil, time.Second*2)4// Parse response for version info across all servers1# Python: monitor version consistency2import nats3
4async def check_versions():5 nc = await nats.connect()6 # Collect versions from server info responses7 inbox = nc.new_inbox()8 sub = await nc.subscribe(inbox)9 await nc.publish("$SYS.REQ.SERVER.PING", b"", reply=inbox)10 versions = set()11 try:12 while True:13 msg = await sub.next_msg(timeout=2)14 import json15 info = json.loads(msg.data)16 versions.add(info.get("server", {}).get("ver", "unknown"))17 except:18 pass19 if len(versions) > 1:20 print(f"WARNING: mixed versions detected: {versions}")Maintain a server inventory. Track which servers exist, their expected version, and their deployment pipeline. This prevents the “forgotten server” scenario.
Set a maximum mismatch duration. Establish a policy (e.g., “all servers must be on the same version within 24 hours of starting an upgrade”) and alert if the window is exceeded.
Pin versions in deployment manifests. Whether you use Docker images, systemd unit files, or package management, pin the nats-server version explicitly rather than using latest or unversioned references.
Synadia Insights detects version mismatches automatically every collection epoch, distinguishing between active rolling upgrades (recent restarts) and stale mismatches that need attention.
NATS is designed for rolling upgrades and mixed-version clusters function correctly for the duration of the upgrade. There’s no hard time limit, but best practice is to complete the upgrade within a single maintenance window — typically minutes to hours, not days. The risk isn’t immediate failure; it’s the accumulation of inconsistent behavior and the possibility of forgetting to finish.
Not under normal circumstances. NATS maintains wire protocol compatibility across minor versions, and Raft replication works correctly between different versions within the same major release. However, running very old versions alongside new ones (spanning multiple major releases) is untested territory and not recommended. Always upgrade incrementally.
No. Simultaneous upgrades cause a full cluster outage. Rolling upgrades with lame duck mode maintain cluster availability throughout — clients reconnect to remaining servers while each node is upgraded. The brief version mismatch during a rolling upgrade is far less risky than downtime.
Investigate the failure before proceeding. Common causes include incompatible configuration options (the new version may deprecate or rename settings), insufficient disk space for the new binary, or permission issues. Check the server logs after starting the new version. Don’t leave the cluster in a mixed state while troubleshooting — either fix the issue or roll the problem server back to the old version.
No. The nats CLI is designed to work with a range of server versions. You can use a newer CLI with older servers and vice versa. However, some CLI features may not be available if the server doesn’t support the underlying API. Keep the CLI reasonably current for the best experience.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community