Checks/CLUSTER_007

NATS Gateway Disconnection: What It Means and How to Fix It

Severity
Critical
Category
Health
Applies to
Cluster
Check ID
CLUSTER_007
Detection threshold
Gateway present at previous epoch but missing at current

A gateway disconnection means a NATS server lost a gateway connection since the previous epoch — a remote cluster that was connected is now unreachable from the local server. In a supercluster topology, this severs cross-cluster message routing and breaks any workloads that depend on inter-cluster communication. Gateways auto-reconnect with randomized jitter, so transient network issues typically resolve automatically.

Why this matters

Gateways are the backbone of NATS superclusters. Each gateway connection links two clusters, allowing messages, subscriptions, and JetStream operations to flow between them. When a gateway drops, the affected server can no longer route messages to or from the remote cluster. If the disconnection affects all servers in the local cluster, the remote cluster is completely isolated — every cross-cluster publish, request-reply, and JetStream mirror or source stops working.

The blast radius depends on your architecture. If clients in one cluster subscribe to subjects published in another, those subscriptions go dark. Request-reply patterns that span clusters will time out. JetStream streams configured as mirrors or sources of remote streams stop receiving data, and the lag grows every second the gateway is down. For organizations using superclusters for geographic distribution or disaster recovery, a gateway disconnection can mean an entire region loses access to shared data.

Gateway disconnections are also asymmetric. A server in cluster A might lose its gateway to cluster B while other servers in cluster A maintain theirs. In this case, the affected server routes messages through its cluster peers — but at the cost of extra hops and increased latency. If multiple servers lose their gateways simultaneously, the degradation compounds. And because gateway connections carry interest propagation state, a reconnection triggers a full resynchronization of subscription interest, which can cause a brief storm of subscription traffic.

Common causes

  • Network partition between clusters. The most common cause. A firewall change, routing table update, or WAN link failure severs connectivity between the gateway ports of the two clusters. Gateways run on a dedicated port (typically 7522), and that port must be reachable bidirectionally between clusters.

  • Firewall or security group change. A rule update blocked the gateway port without anyone realizing it. This is especially common in cloud environments where security groups are managed separately from application configuration, or during infrastructure-as-code rollouts that inadvertently tighten rules.

  • TLS certificate expiration or mismatch. Gateways enforce TLS when configured. If a certificate expires, is rotated without updating all clusters, or the CA chain doesn’t match between clusters, the TLS handshake fails and the gateway connection drops.

  • DNS resolution failure. If gateways are configured with hostnames rather than IPs, a DNS outage or stale DNS cache prevents the server from resolving the remote gateway address. The connection drops and reconnection attempts fail until DNS recovers.

  • Remote cluster is entirely down. If every server in the remote cluster has crashed, restarted, or been drained, there’s nothing for the gateway to connect to. The disconnection is a symptom of a larger outage in the remote cluster.

  • Gateway configuration removed or changed. A config reload or server restart with an updated configuration that omits a gateway block will drop that gateway connection. This can happen during config management automation that generates server configs from templates.

How to diagnose

Confirm the gateway is missing

Check the gateway status on the affected server:

Terminal window
nats server report gateways

This shows all connections including gateways. A missing remote cluster in the output confirms the disconnection.

For detailed gateway state, query the monitoring endpoint directly:

Terminal window
curl -s http://localhost:8222/gatewayz | jq .

The response lists outbound and inbound gateway connections. Look for the remote cluster name — if it’s absent from both outbound_gateways and inbound_gateways, the connection is fully down.

Check if the disconnection is server-specific or cluster-wide

Query multiple servers to determine scope:

Terminal window
nats server list

Then check gateways on each server:

Terminal window
curl -s http://<server-ip>:8222/gatewayz | jq '.outbound_gateways | keys'

If all servers in the cluster are missing the remote gateway, it’s a cluster-wide issue (network or remote cluster down). If only one server is affected, it’s likely a local network or configuration issue.

Check server logs for gateway errors

Server logs record gateway disconnection reasons:

1
[WRN] Gateway connection to "cluster-east" lost
2
[ERR] Error connecting to gateway "cluster-east": dial tcp 10.0.2.10:7522: connect: connection refused

Key patterns to look for:

  • connection refused — remote gateway port is not listening (server down or port blocked)
  • i/o timeout — network path is blocked or too slow
  • TLS handshake error — certificate issue
  • no such host — DNS resolution failed

Verify network connectivity

Test the gateway port directly from the affected server:

Terminal window
# Test TCP connectivity to the remote gateway
nc -zv <remote-gateway-host> 7522
# If DNS-based, verify resolution
dig <remote-gateway-host>

Check the remote cluster

If network connectivity looks fine, verify the remote cluster is healthy:

Terminal window
# Connect to the remote cluster directly
nats server list --server nats://<remote-cluster-host>:4222

If the remote cluster is unresponsive, the gateway disconnection is a secondary symptom. Focus on restoring the remote cluster first.

How to fix it

Immediate: restore connectivity

Gateway connections auto-reconnect with randomized jitter, so transient network issues resolve automatically. If the disconnection persists, investigate the following causes in order:

Check TLS certificate validity. This is a common cause — especially stale OCSP responses when OCSP stapling is enabled. Verify certificates haven’t expired and OCSP responders are reachable.

Verify firewall rules between clusters. If a firewall change caused the disconnection, revert the rule or add an allow rule for the gateway port between clusters. Confirm with a TCP connectivity test:

Terminal window
nc -zv <remote-gateway-host> 7522

Restart the gateway connection. If the network path is restored but the gateway hasn’t reconnected automatically, send a config reload signal:

Terminal window
nats server config reload <server-id> --server <affected-server>

Or send a SIGHUP to the NATS server process:

Terminal window
kill -HUP $(pidof nats-server)

NATS servers automatically attempt gateway reconnection, but a reload can accelerate the process.

Confirm gateway names are consistent across all clusters. Mismatched gateway names between clusters cause connection failures that look like network issues but are actually configuration problems.

If the remote cluster is down, restore it first. Gateway connections cannot be established if there’s nothing to connect to. Bring at least one server in the remote cluster back online.

Short-term: harden gateway connectivity

Ensure consistent gateway configuration across all servers. Every server in a cluster must have identical gateway blocks. A mismatch causes asymmetric connectivity:

1
gateway {
2
name: "cluster-west"
3
listen: "0.0.0.0:7522"
4
gateways: [
5
{ name: "cluster-east", urls: ["nats://east-1:7522", "nats://east-2:7522", "nats://east-3:7522"] }
6
]
7
}

List multiple URLs per remote cluster so the gateway can connect to any available server.

Set up TLS certificate rotation monitoring. If gateways use TLS, monitor certificate expiry and rotate well before expiration:

1
// Go: programmatic check of gateway connectivity
2
nc, err := nats.Connect("nats://localhost:4222",
3
nats.Name("gateway-monitor"),
4
)
5
if err != nil {
6
log.Fatal(err)
7
}
8
9
// Subscribe to server advisory events
10
nc.Subscribe("$SYS.SERVER.*.CLIENT.DISCONNECT", func(msg *nats.Msg) {
11
// Parse and alert on gateway disconnections
12
log.Printf("Disconnect event: %s", string(msg.Data))
13
})
1
# Python: monitor gateway status via /gatewayz
2
import asyncio
3
import aiohttp
4
5
async def check_gateways(server_url: str, expected_clusters: list[str]):
6
async with aiohttp.ClientSession() as session:
7
async with session.get(f"{server_url}/gatewayz") as resp:
8
data = await resp.json()
9
connected = set(data.get("outbound_gateways", {}).keys())
10
missing = set(expected_clusters) - connected
11
if missing:
12
print(f"ALERT: Missing gateways: {missing}")

Long-term: build resilience into the supercluster topology

Use multiple gateway URLs per remote cluster. Don’t point gateways at a single host or load balancer. List all servers in the remote cluster so the gateway can connect to any surviving member:

1
gateways: [
2
{
3
name: "cluster-east"
4
urls: [
5
"nats://east-1.example.com:7522"
6
"nats://east-2.example.com:7522"
7
"nats://east-3.example.com:7522"
8
]
9
}
10
]

Monitor gateway health proactively. Don’t wait for application-level failures to notice a gateway drop. Alert on the /gatewayz endpoint and the gateways field in /varz.

Synadia Insights evaluates gateway connectivity automatically every collection epoch and alerts immediately when a previously connected gateway disappears.

Implement redundant network paths. For production superclusters spanning regions, use redundant WAN links or VPN tunnels. A single network path between clusters is a single point of failure for all cross-cluster traffic.

Frequently asked questions

Does a gateway disconnection cause message loss?

It depends on the messaging pattern. For core NATS pub/sub, messages published to subjects with subscribers in the disconnected cluster will not be delivered to those subscribers — they are effectively lost. For JetStream streams, data is persisted, but mirrors and sources stop receiving updates until the gateway reconnects. Request-reply patterns will time out rather than lose data, but the calling service will see errors.

How quickly does NATS reconnect a dropped gateway?

NATS servers immediately attempt to reconnect a lost gateway. The reconnection follows an exponential backoff strategy. If the underlying network issue is resolved, the gateway typically reconnects within seconds. However, after reconnection, the servers must resynchronize subscription interest state, which can take longer in clusters with many active subscriptions.

Can I have gateways between more than two clusters?

Yes. NATS superclusters support a full mesh of gateway connections between any number of clusters. Each cluster connects to every other cluster via gateways. The gateway configuration on each server should list all remote clusters. NATS handles the interest propagation across the entire mesh automatically.

What’s the difference between a gateway disconnection and a route count low?

Route Count Low (CLUSTER_005) indicates a server has lost connections to peers within the same cluster — the intra-cluster mesh is broken. Gateway Disconnection (CLUSTER_007) means the connection to a different cluster is lost. Routes carry intra-cluster traffic; gateways carry inter-cluster traffic. Both are critical, but they affect different failure domains.

How do I test gateway connectivity without affecting production?

Use the monitoring endpoint to verify gateway status without sending application traffic:

Terminal window
curl -s http://localhost:8222/gatewayz | jq '.outbound_gateways | to_entries[] | {cluster: .key, connected: .value.connected}'

This queries the gateway state without generating any load. For active testing, publish a test message on a subject that routes across gateways and verify receipt on the other side.

Proactive monitoring for NATS gateway disconnection with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel