Checks/CLUSTER_008

NATS Gateway Config Mismatch: What It Means and How to Fix It

Severity
Warning
Category
Consistency
Applies to
Cluster
Check ID
CLUSTER_008
Detection threshold
server's gateway connection set differs from the cluster majority

A gateway config mismatch means one or more servers in a NATS cluster have a different set of gateway connections than the cluster majority. Asymmetric gateways cause routing failures — some servers can reach remote clusters that others can’t, causing inconsistent message delivery and partial routing failures that are extremely difficult to debug. If reject_unknown_cluster is enabled, unlisted gateways are rejected outright. When disabled (the default), gateways can be discovered implicitly via gossip, but explicit configuration is still recommended for consistency.

Why this matters

Gateways are NATS’s mechanism for connecting independent clusters into a supercluster. Each server in a cluster maintains gateway connections to servers in remote clusters, forming inter-cluster communication links. When a client publishes a message on one cluster, gateways forward it to remote clusters where there’s matching subscription interest. This only works if every server in the cluster has the same view of which remote clusters exist.

When gateway configuration is inconsistent, message delivery becomes non-deterministic. A client publishing on server A (which has gateways to clusters X and Y) reaches subscribers in both remote clusters. The same client publishing the same message on server B (which only has a gateway to cluster X) only reaches subscribers in cluster X. From the client’s perspective, the behavior depends on which server they happen to be connected to — a condition that can change with every reconnection. This is the worst kind of production bug: intermittent, hard to reproduce, and invisible unless you’re specifically comparing message delivery across different source servers.

The problem compounds with JetStream. Cross-cluster stream mirroring and sourcing rely on gateways to move data between clusters. If some servers in the source cluster can’t reach the destination cluster (because they’re missing the gateway), mirror lag increases and may stall entirely when the message happens to be published on a server without the gateway connection. The stall is intermittent — it depends on which server receives the publish — making it look like a network issue rather than a configuration problem.

Common causes

  • Configuration management drift. The most common cause. One server’s configuration was updated (to add a new remote cluster or change a gateway port), but the change wasn’t propagated to all servers. This happens frequently in manual deployments and in environments without automated config management.

  • New remote cluster added to only some servers. When a new cluster is added to the supercluster, each existing cluster needs its gateway configuration updated to include the new cluster. If the update is applied to only some servers in a cluster (partial rollout, missed server), those servers can reach the new cluster but their peers can’t.

  • Server deployed from an outdated template. A server replacement or scale-up event provisions a new server from a configuration template that doesn’t include the latest gateway entries. The new server joins the cluster with a stale gateway list.

  • Gateway block missing entirely from one server. One server in the cluster has no gateway configuration block at all, while its peers do. This can happen when a server was originally deployed as a standalone or cluster-only node and later the cluster was connected to a supercluster without updating all members.

  • Port or TLS mismatch on one server. The gateway name and remote list match, but one server uses a different listen port or different TLS settings for gateways. The remote cluster can’t connect back to this server, creating an asymmetric gateway where some servers in the cluster have bidirectional gateway connections and the mismatched server has a unidirectional or failed connection.

How to diagnose

Compare gateway connections across all servers

Terminal window
nats server list

Look at the GW (gateways) column. All servers in the same cluster should show the same number of gateway connections. If one server shows fewer (or zero), it’s missing gateway configuration.

Inspect gateway details per server

Terminal window
curl -s http://<server-host>:8222/gatewayz | jq '{name, outbound_gateways: [.outbound_gateways | keys[]], inbound_gateways: [.inbound_gateways | keys[]]}'

Run this against every server in the cluster and compare the output. The outbound_gateways list should be identical across all servers — each should list the same set of remote cluster names. Differences identify the mismatch.

Check for servers with no gateway config

Terminal window
curl -s http://<server-host>:8222/varz | jq '.gateway'

If this returns null or an empty string, the server has no gateway configuration at all.

Compare full gateway configuration

For a detailed comparison, check each server’s gateway settings:

Terminal window
# Run on each server in the cluster
for host in s1 s2 s3; do
echo "=== $host ==="
curl -s http://$host:8222/gatewayz | jq '{
name: .name,
outbound: [.outbound_gateways | to_entries[] | {name: .key, connected: .value.connection.name}],
inbound_count: [.inbound_gateways | to_entries[] | .value | length] | add
}'
done

This shows the gateway name, outbound connections (which remote clusters this server connects to), and inbound connection count for each server. Differences between servers reveal the mismatch.

Check server logs for gateway errors

Terminal window
journalctl -u nats-server --since "1 hour ago" | grep -i "gateway"

Look for errors like Gateway configuration mismatch, connection failures to remote clusters, or TLS errors on gateway ports.

How to fix it

Immediate: update the mismatched server’s config

Identify the server with the different gateway configuration and update it to match the cluster majority:

1
gateway {
2
name: "cluster-east"
3
listen: "0.0.0.0:7222"
4
5
gateways = [
6
{name: "cluster-west", urls: ["nats://west-s1:7222", "nats://west-s2:7222", "nats://west-s3:7222"]}
7
{name: "cluster-eu", urls: ["nats://eu-s1:7222", "nats://eu-s2:7222", "nats://eu-s3:7222"]}
8
]
9
}

Key requirements:

  • name must be identical on all servers in the cluster
  • The gateways list must include all remote clusters
  • The listen port must be consistent and reachable by remote clusters
  • TLS configuration must match on both sides of every gateway connection

If reject_unknown_cluster is enabled on any cluster, every gateway must be explicitly listed — implicit discovery via gossip is blocked.

After updating the configuration, reload the server:

Terminal window
nats-server --signal reload=<pid>

Gateway configuration changes take effect on reload — a restart is not required. The server will establish connections to any newly configured remote clusters.

Short-term: validate all servers match

After fixing the known mismatch, verify every server in the cluster has identical gateway configuration:

Terminal window
# Quick validation: compare gateway names across servers
for host in s1 s2 s3; do
echo -n "$host gateways: "
curl -s http://$host:8222/gatewayz | jq -r '[.outbound_gateways | keys[]] | sort | join(", ")'
done

All servers should output the same sorted list of remote cluster names.

Also verify that remote clusters can reach all servers in this cluster:

Terminal window
# From a remote cluster server, check inbound gateways
curl -s http://<remote-server>:8222/gatewayz | jq '.inbound_gateways["cluster-east"] | length'

The count should match the number of servers in the local cluster. If it’s lower, some servers aren’t reachable from the remote cluster (possibly due to port or TLS mismatch).

Long-term: prevent configuration drift

Use a single source of truth for gateway config. Define the gateway block once in your configuration management system and deploy it identically to all servers:

1
# Ansible example
2
- name: Deploy NATS config
3
template:
4
src: nats-server.conf.j2
5
dest: /etc/nats/nats-server.conf
6
notify: reload nats-server
7
vars:
8
gateway_name: "cluster-east"
9
remote_gateways:
10
- name: "cluster-west"
11
urls: ["nats://west-s1:7222", "nats://west-s2:7222"]
12
- name: "cluster-eu"
13
urls: ["nats://eu-s1:7222", "nats://eu-s2:7222"]

Include gateway validation in deployment pipelines. Before completing a deployment, verify that all servers report the same gateway connections:

Terminal window
# Post-deployment validation
expected_gateways="cluster-eu, cluster-west"
for host in s1 s2 s3; do
actual=$(curl -s http://$host:8222/gatewayz | jq -r '[.outbound_gateways | keys[]] | sort | join(", ")')
if [ "$expected_gateways" != "$actual" ]; then
echo "MISMATCH on $host: expected '$expected_gateways', got '$actual'"
exit 1
fi
done

Document the supercluster topology. Maintain a topology document or diagram that lists all clusters and their gateway relationships. When a new cluster is added, the document serves as the checklist of which existing clusters need their gateway configuration updated.

Frequently asked questions

Does a config reload apply gateway changes, or is a restart required?

A config reload is sufficient for gateway changes. When you send nats-server --signal reload, the server reads the updated configuration and establishes connections to any new remote gateways without disrupting existing connections or client traffic. No restart needed.

Can gateway gossip compensate for a missing gateway config?

Partially. NATS gateways support discovery — when one server in a cluster connects to a remote cluster, it learns about other servers in that remote cluster and shares this information with its cluster peers. However, this only works for discovering additional servers within an already-configured remote cluster. If a server is missing the remote cluster entirely from its gateway config, gossip won’t help — the server has to be explicitly configured to connect to at least one server in the remote cluster.

How does a gateway mismatch differ from a gateway disconnection?

A gateway disconnection (CLUSTER_007) means a previously working gateway connection dropped — the remote cluster became unreachable due to a network or server issue. A gateway config mismatch (CLUSTER_008) means the server was never configured to connect to a remote cluster that its peers are connected to. The disconnection is a runtime failure; the mismatch is a configuration error. Both result in asymmetric routing, but the mismatch won’t self-heal when the network recovers.

What happens to messages that should cross a missing gateway?

They’re silently dropped on the server without the gateway. NATS gateways use interest-based forwarding — if the server has no gateway to the remote cluster, it doesn’t know about subscriptions in that cluster, so it has no reason to forward. The publisher receives no error (core NATS publishes are fire-and-forget). For JetStream, the stream on the local cluster stores the message normally, but mirrors or sources in the unreachable remote cluster don’t receive it until the gateway is established.

Should all servers in a cluster have identical gateway URLs?

The gateway name and the list of remote cluster names must be identical. The specific URLs within each remote gateway entry can vary — NATS uses these as seed URLs to discover the full set of remote servers. In practice, using identical URLs everywhere is simplest and avoids the kind of inconsistency this check detects. If a seed URL is unreachable, the server tries other URLs in the list, so including multiple URLs per remote gateway provides resilience.

Proactive monitoring for NATS gateway config mismatch with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel