Checks/SERVER_018

NATS High Gateway RTT: What It Means and How to Fix It

Severity
Warning
Category
Performance
Applies to
Server
Check ID
SERVER_018
Detection threshold
gateway connection RTT exceeds configured threshold

A NATS gateway connection between two clusters is experiencing round-trip time above the configured threshold — cross-cluster message delivery is slower than expected, which can degrade request-reply latency, increase JetStream replication lag, and cause timeouts for workloads that span clusters.

Why this matters

Gateways are NATS’s mechanism for connecting independent clusters into a supercluster. Unlike routes (which connect servers within a single cluster), gateways carry traffic between geographically or logically separated clusters. Every message that crosses a cluster boundary — whether through subject interest propagation, queue group load balancing, or JetStream cross-cluster access — traverses a gateway connection. Gateway RTT directly determines the latency floor for all cross-cluster operations.

For request-reply patterns that span clusters, high gateway RTT doubles the impact: the request crosses the gateway once, and the reply crosses it again. A gateway with 100ms RTT adds 200ms to every cross-cluster request-reply exchange. For services with tight SLA requirements, this can push response times beyond acceptable thresholds. Clients may start timing out, retrying, and amplifying load on an already slow path.

JetStream replication across clusters is particularly sensitive to gateway latency. When a stream has replicas in different clusters, every published message must be acknowledged by remote replicas via Raft consensus. High gateway RTT slows Raft round-trips, increasing replication lag and reducing the effective write throughput of cross-cluster replicated streams. In extreme cases, sustained high RTT can cause Raft leader elections to time out, triggering unnecessary leader changes that further disrupt operations.

Common causes

  • Geographic distance between clusters. The speed of light imposes a hard floor on RTT. US East to US West is approximately 60-80ms. US to Europe is 80-120ms. US to Asia-Pacific is 150-250ms. If your clusters are in distant regions, some gateway RTT is inherent and expected.

  • Suboptimal network routing. Traffic between clusters may traverse unnecessary hops, transit through congested peering points, or route through indirect paths. Cloud provider inter-region networking often uses the public internet by default, which adds latency and variability compared to dedicated interconnects.

  • Network congestion on the gateway path. Shared network infrastructure between clusters — transit links, VPN tunnels, peering exchanges — can become congested during peak traffic. RTT spikes correlate with bandwidth saturation on intermediate links.

  • VPN or overlay network overhead. If gateway traffic crosses a VPN, WireGuard tunnel, or cloud overlay network (e.g., transit gateways, peering connections), the encapsulation and encryption add latency. Each hop in the overlay adds processing time.

  • Gateway connection routing through an indirect server. In NATS superclusters, gateway connections may not always connect the two closest servers. If the gateway is established between servers that are not the optimal pair, latency is higher than necessary.

  • High gateway traffic volume. Large message volumes on gateway connections can saturate the connection’s throughput capacity, causing queuing delays that manifest as increased RTT. This is related to but distinct from network congestion — the NATS connection itself becomes the bottleneck.

How to diagnose

Check gateway RTT across the cluster

Terminal window
nats server report gateways

This reports all gateway connections with their RTT values. Identify which cluster-to-cluster connections have elevated RTT and which specific server pairs are affected.

Measure RTT from the client perspective

Terminal window
# Check RTT to your connected server
nats rtt
# Check RTT across all servers
nats server list

If client RTT to the local cluster is normal but cross-cluster operations are slow, the gateway path is the bottleneck.

Check gateway pending bytes

Terminal window
nats server report gateways --json | jq '.[] | {name: .name, rtt: .rtt, pending: .pending_bytes}'

High pending bytes combined with high RTT indicates the gateway connection is saturated — messages are queuing faster than they can be transmitted across the link.

Compare with network-level latency

Terminal window
# From the server host, measure raw network RTT to the remote cluster
ping <remote_cluster_server_ip>
# Or use a more accurate TCP-level measurement
nats rtt --server nats://<remote_cluster_server>:4222

If the raw network RTT matches the gateway RTT, the latency is network-inherent. If gateway RTT is significantly higher than network RTT, something in the NATS layer (TLS overhead, message volume, pending buffers) is adding latency.

Check for gateway traffic volume

Terminal window
nats server report gateways --json | jq '.[] | {name: .name, in_msgs: .in_msgs, out_msgs: .out_msgs, in_bytes: .in_bytes, out_bytes: .out_bytes}'

High message or byte rates on the gateway connection suggest traffic volume may be contributing to RTT increases.

How to fix it

Immediate: determine if the RTT is expected

Check geographic distance. If your clusters are in different regions, calculate the expected RTT based on physical distance. There’s a hard floor you cannot go below:

RouteExpected RTT
Same region1-5ms
US East ↔ US West60-80ms
US ↔ Europe80-120ms
US ↔ Asia-Pacific150-250ms

If your measured RTT is close to the expected floor for the geographic distance, the gateway is performing normally. Adjust the check threshold to avoid false alerts:

Terminal window
# Update threshold in Insights configuration if the RTT is expected
# for your deployment topology

If RTT is higher than expected, proceed with network and configuration investigation.

Short-term: optimize the network path

Use dedicated interconnects. Cloud providers offer dedicated inter-region connectivity (AWS Transit Gateway, GCP Cloud Interconnect, Azure ExpressRoute) that avoids public internet routing. These typically reduce RTT by 20-40% and significantly reduce jitter.

Optimize VPN/tunnel configuration. If using a VPN between clusters, ensure encryption is hardware-accelerated and the tunnel endpoint is on the same host or in the same availability zone as the NATS server. WireGuard generally adds less overhead than IPSec.

Reduce gateway traffic volume. Minimize unnecessary cross-cluster message flow by localizing subject interest:

1
// Go - connect to the local cluster to avoid gateway hops
2
nc, _ := nats.Connect("nats://local-cluster:4222",
3
nats.Name("order-processor"),
4
)
5
6
// Subscribe to subjects served locally
7
sub, _ := nc.Subscribe("orders.region-east.>", handler)
1
# Python - use cluster-aware connection
2
import nats
3
4
# Connect to the nearest cluster
5
nc = await nats.connect("nats://local-cluster:4222")
6
7
# Structure subjects to keep traffic local
8
await nc.subscribe("orders.region-east.>", cb=handler)

Long-term: architectural changes

Place streams closer to consumers. If consumers in Cluster B are reading streams hosted in Cluster A, every fetch crosses the gateway. Create a local mirror or source in Cluster B:

Terminal window
# Create a mirror of the remote stream in the local cluster.
# External-API mirroring (cross-domain) is configured via the JSON form
# of --mirror or via interactive prompts; there is no `--mirror-api-prefix`
# flag on `nats stream add`.
nats stream add ORDERS-MIRROR \
--mirror nats:ORDERS@hub \
--storage file \
--replicas 3

This keeps reads local and only the replication traffic crosses the gateway.

Use subject-based partitioning to reduce cross-cluster traffic. Structure your subject namespace so that most publish-subscribe interest is satisfied within a single cluster:

1
# nats-server.conf - gateway configuration with optimized accounts
2
gateway {
3
name: "us-east"
4
port: 7222
5
gateways: [
6
{name: "us-west", urls: ["nats://us-west-1:7222", "nats://us-west-2:7222"]}
7
{name: "eu-central", urls: ["nats://eu-1:7222", "nats://eu-2:7222"]}
8
]
9
}

Consider hub-and-spoke topology for high-latency links. Rather than direct gateways between all clusters (full mesh), route through a central hub cluster with good connectivity to all regions. This can reduce the number of high-latency links and simplify traffic management, though it adds a hop for some paths.

Monitor gateway traffic ratios. The High Gateway Traffic Ratio optimization check (OPT_PLACE_003) identifies when excessive traffic crosses gateways. Use it alongside this check to determine whether the gateway RTT is a network problem or a placement problem.

Frequently asked questions

Is high gateway RTT always a problem?

No. Gateways between geographically distant clusters will inherently have high RTT due to the speed of light. A 100ms RTT between US and Europe is normal and expected. This check is most useful for detecting unexpected increases — RTT that exceeds what the physical distance and network topology should produce. Tune the threshold to match your expected baseline for each gateway link.

How does high gateway RTT affect JetStream cross-cluster replication?

Every Raft consensus round-trip for a cross-cluster replicated stream must traverse the gateway. A 100ms gateway RTT means each Raft round takes at least 200ms (propose + acknowledge). This limits the maximum write throughput for cross-cluster streams and increases replication lag. For write-heavy streams, consider placing all replicas within a single cluster and using mirrors for cross-cluster read access.

What is the difference between gateway RTT and route RTT?

Route RTT (SERVER_010) measures latency between servers within the same cluster. Gateway RTT (this check) measures latency between servers in different clusters. Routes are typically on the same network segment with sub-millisecond latency. Gateways cross network boundaries and have higher, more variable latency. The same diagnostic approach applies to both, but the expected baseline values are very different.

Can I have multiple gateway connections between the same two clusters?

Yes. NATS automatically establishes gateway connections between cluster members. The number of gateway connections scales with cluster size — each server can have gateway connections to servers in remote clusters. NATS routes messages through the optimal gateway path automatically. You don’t need to manage individual gateway connections.

It depends on the use case. Leafnodes connect a single server (or small cluster) to a hub cluster and are simpler to configure. Gateways connect full clusters bidirectionally. For edge locations with one or two servers and a high-latency link to the core, leafnodes are often more appropriate. For connecting two production clusters that need full bidirectional messaging, gateways are the right choice. Leafnodes also have their own RTT check — High Leaf RTT.

Proactive monitoring for NATS high gateway rtt with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel