NATS Server Version Mismatch: What It Means and How to Fix It

A server version mismatch means one or more NATS servers in your cluster are running a different software version than the majority. This is expected during a rolling upgrade and a problem when it’s unintentional — indicating an incomplete upgrade, a forgotten server, or a misconfigured deployment pipeline.

Why this matters

NATS servers in a cluster communicate via internal routes and coordinate through the Raft consensus protocol for JetStream operations. While NATS maintains strong backward compatibility between minor versions, running a mixed-version cluster introduces risks that grow with the version gap and the time the mismatch persists.

Feature availability becomes inconsistent. A new configuration option, protocol enhancement, or bug fix applied to upgraded servers doesn’t exist on the older ones. If a client or configuration depends on a feature introduced in the newer version, it works on some servers and fails on others — depending on which server the client connects to or which server is the Raft leader. These intermittent, topology-dependent failures are among the hardest to debug.

Bug fixes and security patches are the more urgent concern. If you upgraded three of five servers to address a CVE or a data-handling bug, the two remaining servers are still vulnerable. In a clustered environment, a vulnerability on any server is a vulnerability for the cluster — route connections between servers are trusted. Similarly, a bug that causes incorrect Raft behavior on the older version can affect the entire consensus group, even if the leader is running the patched version.

Rolling upgrades are a routine part of NATS operations, so a temporary mismatch is normal. The problem is when “temporary” becomes “permanent” — when the last server in the upgrade sequence gets forgotten, when a different deployment pipeline manages some nodes, or when a server was rebuilt from an older image. Synadia Insights flags version mismatches so they don’t silently persist.

Common causes

Incomplete rolling upgrade. The most common cause. An operator upgraded most servers but didn’t finish — interrupted by another task, a weekend, or an error on the remaining servers. The cluster runs in a mixed state indefinitely.
Forgotten server in the cluster. A server deployed months ago, perhaps in a different availability zone or managed by a different team, wasn’t included in the upgrade plan. It continues running the old version unnoticed.
Separate deployment pipelines. Different servers are managed by different automation (Terraform, Ansible, Kubernetes operators) or different teams. One pipeline was updated, the other wasn’t. Common in organizations where infrastructure grew organically.
Staging server accidentally joined production. A development or staging server with a different version was configured with production cluster routes. It joined the cluster and now reports a mismatched version.
Rollback on some nodes. An upgrade was attempted, some servers hit issues and were rolled back, while others stayed on the new version. The rollback was meant to be temporary but became permanent.

How to diagnose

List all server versions

nats server list

This shows every server in the cluster with its version, cluster name, and connection count. Look for servers reporting a different version than the majority.

Get detailed info on a specific server

nats server info <server-name>

This shows the full version string, Go runtime version, git commit, and configuration details for a specific server.

Check via the monitoring endpoint

curl -s http://localhost:8222/varz | jq '{server_name: .server_name, version: .version, go: .go, git_commit: .git_commit}'

The /varz endpoint on each server reports its version. This is useful for automated checking across all servers.

Compare versions programmatically

# Get versions from all servers in one pass
nats server list --json | jq -r '.[] | "\(.name)\t\(.ver)"' | sort -k2

Group by version to see the split — the majority version is the target, and any outliers need upgrading (or the majority needs rolling back if the outlier is correct).

Check if the mismatch is from a rolling upgrade in progress

# Check if servers have been recently restarted
nats server list --json | jq '.[] | {name: .name, version: .ver, uptime: .uptime}'

Servers with recent restarts (short uptime) that are on the newer version indicate an active rolling upgrade. Servers with long uptime on the old version indicate a stalled or forgotten upgrade.

How to fix it

Immediate: assess the risk

Determine the version gap. Minor version differences (2.10.22 vs 2.10.24) are low risk — they’re typically bug fixes with full wire compatibility. Major version differences (2.9.x vs 2.10.x) carry higher risk due to potential protocol and feature changes.

# Check the version difference
nats server list --json | jq '[.[].ver] | unique'

If the gap is minor and the cluster is functioning normally, the urgency is low — but complete the upgrade at the next maintenance window.

Short-term: complete the rolling upgrade

Use lame duck mode for safe upgrades. Lame duck mode gracefully drains client connections and migrates Raft leadership before the server shuts down:

# Signal the server to enter lame duck mode
nats-server --signal ldm=<pid>

# Wait for connections to drain (check with server list)
nats server list

# Stop the server
systemctl stop nats-server

# Upgrade the binary
# (package manager, container image pull, binary replacement)

# Start the new version
systemctl start nats-server

# Verify it rejoined with the correct version
nats server list

Upgrade one server at a time. Wait for each server to fully rejoin the cluster and for all its Raft groups to catch up before proceeding to the next:

# After starting the upgraded server, verify Raft health
nats server report jetstream

All Raft groups should show the upgraded server as current before moving to the next server.

Verify client reconnection. After each server upgrade, confirm clients reconnected successfully:

nats server report connections

Long-term: prevent version drift

Automate server upgrades. Use configuration management (Ansible, Puppet) or container orchestration (Kubernetes with the NATS Helm chart) to ensure all servers are deployed from the same version source:

1
// Go: programmatically check server versions
2
nc, _ := nats.Connect(url)
3
resp, _ := nc.Request("$SYS.REQ.SERVER.PING", nil, time.Second*2)
4
// Parse response for version info across all servers

1
# Python: monitor version consistency
2
import nats
3

4
async def check_versions():
5
    nc = await nats.connect()
6
    # Collect versions from server info responses
7
    inbox = nc.new_inbox()
8
    sub = await nc.subscribe(inbox)
9
    await nc.publish("$SYS.REQ.SERVER.PING", b"", reply=inbox)
10
    versions = set()
11
    try:
12
        while True:
13
            msg = await sub.next_msg(timeout=2)
14
            import json
15
            info = json.loads(msg.data)
16
            versions.add(info.get("server", {}).get("ver", "unknown"))
17
    except:
18
        pass
19
    if len(versions) > 1:
20
        print(f"WARNING: mixed versions detected: {versions}")

Maintain a server inventory. Track which servers exist, their expected version, and their deployment pipeline. This prevents the “forgotten server” scenario.

Set a maximum mismatch duration. Establish a policy (e.g., “all servers must be on the same version within 24 hours of starting an upgrade”) and alert if the window is exceeded.

Pin versions in deployment manifests. Whether you use Docker images, systemd unit files, or package management, pin the nats-server version explicitly rather than using latest or unversioned references.

Synadia Insights detects version mismatches automatically every collection epoch, distinguishing between active rolling upgrades (recent restarts) and stale mismatches that need attention.

Frequently asked questions

How long is a version mismatch safe during a rolling upgrade?

NATS is designed for rolling upgrades and mixed-version clusters function correctly for the duration of the upgrade. There’s no hard time limit, but best practice is to complete the upgrade within a single maintenance window — typically minutes to hours, not days. The risk isn’t immediate failure; it’s the accumulation of inconsistent behavior and the possibility of forgetting to finish.

Can different NATS server versions cause data loss?

Not under normal circumstances. NATS maintains wire protocol compatibility across minor versions, and Raft replication works correctly between different versions within the same major release. However, running very old versions alongside new ones (spanning multiple major releases) is untested territory and not recommended. Always upgrade incrementally.

Should I upgrade all servers simultaneously instead of rolling?

No. Simultaneous upgrades cause a full cluster outage. Rolling upgrades with lame duck mode maintain cluster availability throughout — clients reconnect to remaining servers while each node is upgraded. The brief version mismatch during a rolling upgrade is far less risky than downtime.

What if one server keeps failing to upgrade?

Investigate the failure before proceeding. Common causes include incompatible configuration options (the new version may deprecate or rename settings), insufficient disk space for the new binary, or permission issues. Check the server logs after starting the new version. Don’t leave the cluster in a mixed state while troubleshooting — either fix the issue or roll the problem server back to the old version.

Does the NATS CLI version need to match the server version?

No. The nats CLI is designed to work with a range of server versions. You can use a newer CLI with older servers and vice versa. However, some CLI features may not be available if the server doesn’t support the underlying API. Keep the CLI reasonably current for the best experience.

FEATURED

RESOURCES

Comparisons