A gateway disconnection means a NATS server lost a gateway connection since the previous epoch — a remote cluster that was connected is now unreachable from the local server. In a supercluster topology, this severs cross-cluster message routing and breaks any workloads that depend on inter-cluster communication. Gateways auto-reconnect with randomized jitter, so transient network issues typically resolve automatically.
Gateways are the backbone of NATS superclusters. Each gateway connection links two clusters, allowing messages, subscriptions, and JetStream operations to flow between them. When a gateway drops, the affected server can no longer route messages to or from the remote cluster. If the disconnection affects all servers in the local cluster, the remote cluster is completely isolated — every cross-cluster publish, request-reply, and JetStream mirror or source stops working.
The blast radius depends on your architecture. If clients in one cluster subscribe to subjects published in another, those subscriptions go dark. Request-reply patterns that span clusters will time out. JetStream streams configured as mirrors or sources of remote streams stop receiving data, and the lag grows every second the gateway is down. For organizations using superclusters for geographic distribution or disaster recovery, a gateway disconnection can mean an entire region loses access to shared data.
Gateway disconnections are also asymmetric. A server in cluster A might lose its gateway to cluster B while other servers in cluster A maintain theirs. In this case, the affected server routes messages through its cluster peers — but at the cost of extra hops and increased latency. If multiple servers lose their gateways simultaneously, the degradation compounds. And because gateway connections carry interest propagation state, a reconnection triggers a full resynchronization of subscription interest, which can cause a brief storm of subscription traffic.
Network partition between clusters. The most common cause. A firewall change, routing table update, or WAN link failure severs connectivity between the gateway ports of the two clusters. Gateways run on a dedicated port (typically 7522), and that port must be reachable bidirectionally between clusters.
Firewall or security group change. A rule update blocked the gateway port without anyone realizing it. This is especially common in cloud environments where security groups are managed separately from application configuration, or during infrastructure-as-code rollouts that inadvertently tighten rules.
TLS certificate expiration or mismatch. Gateways enforce TLS when configured. If a certificate expires, is rotated without updating all clusters, or the CA chain doesn’t match between clusters, the TLS handshake fails and the gateway connection drops.
DNS resolution failure. If gateways are configured with hostnames rather than IPs, a DNS outage or stale DNS cache prevents the server from resolving the remote gateway address. The connection drops and reconnection attempts fail until DNS recovers.
Remote cluster is entirely down. If every server in the remote cluster has crashed, restarted, or been drained, there’s nothing for the gateway to connect to. The disconnection is a symptom of a larger outage in the remote cluster.
Gateway configuration removed or changed. A config reload or server restart with an updated configuration that omits a gateway block will drop that gateway connection. This can happen during config management automation that generates server configs from templates.
Check the gateway status on the affected server:
nats server report gatewaysThis shows all connections including gateways. A missing remote cluster in the output confirms the disconnection.
For detailed gateway state, query the monitoring endpoint directly:
curl -s http://localhost:8222/gatewayz | jq .The response lists outbound and inbound gateway connections. Look for the remote cluster name — if it’s absent from both outbound_gateways and inbound_gateways, the connection is fully down.
Query multiple servers to determine scope:
nats server listThen check gateways on each server:
curl -s http://<server-ip>:8222/gatewayz | jq '.outbound_gateways | keys'If all servers in the cluster are missing the remote gateway, it’s a cluster-wide issue (network or remote cluster down). If only one server is affected, it’s likely a local network or configuration issue.
Server logs record gateway disconnection reasons:
1[WRN] Gateway connection to "cluster-east" lost2[ERR] Error connecting to gateway "cluster-east": dial tcp 10.0.2.10:7522: connect: connection refusedKey patterns to look for:
Test the gateway port directly from the affected server:
# Test TCP connectivity to the remote gatewaync -zv <remote-gateway-host> 7522
# If DNS-based, verify resolutiondig <remote-gateway-host>If network connectivity looks fine, verify the remote cluster is healthy:
# Connect to the remote cluster directlynats server list --server nats://<remote-cluster-host>:4222If the remote cluster is unresponsive, the gateway disconnection is a secondary symptom. Focus on restoring the remote cluster first.
Gateway connections auto-reconnect with randomized jitter, so transient network issues resolve automatically. If the disconnection persists, investigate the following causes in order:
Check TLS certificate validity. This is a common cause — especially stale OCSP responses when OCSP stapling is enabled. Verify certificates haven’t expired and OCSP responders are reachable.
Verify firewall rules between clusters. If a firewall change caused the disconnection, revert the rule or add an allow rule for the gateway port between clusters. Confirm with a TCP connectivity test:
nc -zv <remote-gateway-host> 7522Restart the gateway connection. If the network path is restored but the gateway hasn’t reconnected automatically, send a config reload signal:
nats server config reload <server-id> --server <affected-server>Or send a SIGHUP to the NATS server process:
kill -HUP $(pidof nats-server)NATS servers automatically attempt gateway reconnection, but a reload can accelerate the process.
Confirm gateway names are consistent across all clusters. Mismatched gateway names between clusters cause connection failures that look like network issues but are actually configuration problems.
If the remote cluster is down, restore it first. Gateway connections cannot be established if there’s nothing to connect to. Bring at least one server in the remote cluster back online.
Ensure consistent gateway configuration across all servers. Every server in a cluster must have identical gateway blocks. A mismatch causes asymmetric connectivity:
1gateway {2 name: "cluster-west"3 listen: "0.0.0.0:7522"4 gateways: [5 { name: "cluster-east", urls: ["nats://east-1:7522", "nats://east-2:7522", "nats://east-3:7522"] }6 ]7}List multiple URLs per remote cluster so the gateway can connect to any available server.
Set up TLS certificate rotation monitoring. If gateways use TLS, monitor certificate expiry and rotate well before expiration:
1// Go: programmatic check of gateway connectivity2nc, err := nats.Connect("nats://localhost:4222",3 nats.Name("gateway-monitor"),4)5if err != nil {6 log.Fatal(err)7}8
9// Subscribe to server advisory events10nc.Subscribe("$SYS.SERVER.*.CLIENT.DISCONNECT", func(msg *nats.Msg) {11 // Parse and alert on gateway disconnections12 log.Printf("Disconnect event: %s", string(msg.Data))13})1# Python: monitor gateway status via /gatewayz2import asyncio3import aiohttp4
5async def check_gateways(server_url: str, expected_clusters: list[str]):6 async with aiohttp.ClientSession() as session:7 async with session.get(f"{server_url}/gatewayz") as resp:8 data = await resp.json()9 connected = set(data.get("outbound_gateways", {}).keys())10 missing = set(expected_clusters) - connected11 if missing:12 print(f"ALERT: Missing gateways: {missing}")Use multiple gateway URLs per remote cluster. Don’t point gateways at a single host or load balancer. List all servers in the remote cluster so the gateway can connect to any surviving member:
1gateways: [2 {3 name: "cluster-east"4 urls: [5 "nats://east-1.example.com:7522"6 "nats://east-2.example.com:7522"7 "nats://east-3.example.com:7522"8 ]9 }10]Monitor gateway health proactively. Don’t wait for application-level failures to notice a gateway drop. Alert on the /gatewayz endpoint and the gateways field in /varz.
Synadia Insights evaluates gateway connectivity automatically every collection epoch and alerts immediately when a previously connected gateway disappears.
Implement redundant network paths. For production superclusters spanning regions, use redundant WAN links or VPN tunnels. A single network path between clusters is a single point of failure for all cross-cluster traffic.
It depends on the messaging pattern. For core NATS pub/sub, messages published to subjects with subscribers in the disconnected cluster will not be delivered to those subscribers — they are effectively lost. For JetStream streams, data is persisted, but mirrors and sources stop receiving updates until the gateway reconnects. Request-reply patterns will time out rather than lose data, but the calling service will see errors.
NATS servers immediately attempt to reconnect a lost gateway. The reconnection follows an exponential backoff strategy. If the underlying network issue is resolved, the gateway typically reconnects within seconds. However, after reconnection, the servers must resynchronize subscription interest state, which can take longer in clusters with many active subscriptions.
Yes. NATS superclusters support a full mesh of gateway connections between any number of clusters. Each cluster connects to every other cluster via gateways. The gateway configuration on each server should list all remote clusters. NATS handles the interest propagation across the entire mesh automatically.
Route Count Low (CLUSTER_005) indicates a server has lost connections to peers within the same cluster — the intra-cluster mesh is broken. Gateway Disconnection (CLUSTER_007) means the connection to a different cluster is lost. Routes carry intra-cluster traffic; gateways carry inter-cluster traffic. Both are critical, but they affect different failure domains.
Use the monitoring endpoint to verify gateway status without sending application traffic:
curl -s http://localhost:8222/gatewayz | jq '.outbound_gateways | to_entries[] | {cluster: .key, connected: .value.connected}'This queries the gateway state without generating any load. For active testing, publish a test message on a subject that routes across gateways and verify receipt on the other side.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community