Checks/SERVER_010

NATS High Route RTT: What It Means and How to Fix It

Severity
Warning
Category
Performance
Applies to
Server
Check ID
SERVER_010
Detection threshold
route connection RTT exceeds configured threshold

High Route RTT means the round-trip time on one or more route connections between NATS cluster peers exceeds the configured threshold. Routes are the intra-cluster connections that replicate messages, propagate subscription interest, and carry Raft consensus traffic — elevated latency on these connections directly impacts cluster performance.

Why this matters

NATS clusters form a full mesh of route connections between every server in the cluster. Every message published on one server that has subscribers on another server traverses a route. Every Raft heartbeat, vote, and append-entries request between JetStream stream leaders and followers travels over routes. Route latency is the floor for all inter-server operations.

When route RTT increases, Raft consensus slows proportionally. Raft heartbeats are sent every second by default, with election timeouts of 4–9 seconds. If route RTT approaches or exceeds 1 second, heartbeats arrive late, followers suspect the leader has failed, and unnecessary elections begin. Frequent leader elections (META_003) cause JetStream API operations to stall during each election cycle — typically 4–9 seconds of unavailability per election.

The impact extends beyond Raft. Core NATS message delivery between servers incurs at least one route RTT of additional latency. For request-reply patterns where the requester and responder are on different servers, the round trip crosses the route twice — once for the request, once for the reply. A 50ms route RTT adds 100ms to every cross-server request-reply, which may violate application SLAs. At high message volumes, elevated route RTT causes pending data to accumulate on route connections, eventually triggering Route Pending Pressure (OPT_SYS_005).

Common causes

  • Geographic distance between cluster peers. Cluster servers placed in different availability zones or regions introduce unavoidable network latency. A cluster spanning US-East and US-West adds ~60–80ms of route RTT due to speed-of-light constraints. Clusters should be regional; cross-region connectivity belongs on gateway connections.

  • Network congestion on the cluster port. Shared network infrastructure (switches, firewalls, load balancers) between cluster peers is saturated. Route traffic competes with other workloads for bandwidth, increasing queuing delay. This is common in cloud environments with burstable network bandwidth.

  • Firewall or deep packet inspection. Stateful firewalls or intrusion detection systems inspecting traffic on the cluster port add per-packet latency. TLS-encrypted route traffic may trigger additional inspection overhead. Some cloud security groups add measurable latency compared to VPC-internal routing.

  • Server resource contention. A server under heavy CPU load or experiencing garbage collection pauses may be slow to process incoming route data. The route RTT measurement includes the remote server’s processing time — if the remote peer is CPU-starved, RTT increases even with a healthy network.

  • DNS resolution latency. If route URLs use hostnames, DNS resolution at connection establishment can add latency. While this primarily affects route reconnection rather than steady-state RTT, slow DNS can cause route RTT spikes during cluster topology changes.

  • Virtualization or container overhead. Running NATS servers in VMs with overcommitted hosts or containers with CPU throttling introduces unpredictable latency. Container network overlays (Calico, Weave, Cilium) add per-packet encapsulation overhead compared to host networking.

How to diagnose

Measure route RTT from the NATS CLI

Terminal window
nats server list

Filter for route connections to isolate inter-server latency from client connections. Routes are identified by the connection type in the output.

Check RTT on individual servers

Terminal window
nats rtt --server nats://server1:4222
nats rtt --server nats://server2:4222

Compare RTT from the same client to each server. If one server shows significantly higher RTT, the problem may be localized to that server’s network path.

Inspect route details via the monitoring endpoint

Terminal window
curl http://localhost:8222/routez | jq '.routes[] | {rid, remote_id, ip, rtt}'

This shows per-route RTT measurements from the server’s perspective, including the remote server’s IP address for network-level investigation.

Test raw network latency

Terminal window
# From server1, ping server2 on the cluster port
ping -c 20 server2
# For more precise measurements, use TCP-level tools
hping3 -S -p 6222 -c 10 server2

Compare raw network latency with NATS route RTT. If network latency is low but route RTT is high, the problem is server-side processing, not the network.

Check for CPU contention on the remote peer

Terminal window
nats server report cpu

If the server with high route RTT is also showing elevated CPU, the processing delay is likely the root cause.

How to fix it

Immediate: identify and eliminate network bottlenecks

Check for firewall inspection overhead. If traffic between cluster peers passes through a stateful firewall or IDS, test bypassing it temporarily. Route traffic between trusted servers in the same network should not need deep packet inspection:

Terminal window
# Test latency with and without the firewall path
# Direct connectivity test (if possible)
nats rtt --server nats://server2-direct:4222

Verify network bandwidth isn’t saturated. Check interface utilization on the servers:

Terminal window
# Monitor network throughput on the route interface
sar -n DEV 1 5

Short-term: optimize network path and server resources

Co-locate cluster peers in the same region. If cluster servers span regions, migrate to a single-region deployment. Use gateways for cross-region connectivity instead of routes:

1
# server.conf for cross-region: use gateways, not routes
2
gateway {
3
name: "us-east"
4
listen: 0.0.0.0:7222
5
gateways: [
6
{name: "us-west", urls: ["nats://west-gw:7222"]}
7
]
8
}

Use host networking for containers. If route RTT is elevated due to container network overlay overhead, switch NATS containers to host networking:

1
# Kubernetes: use hostNetwork to bypass overlay
2
spec:
3
hostNetwork: true
4
containers:
5
- name: nats
6
ports:
7
- containerPort: 4222
8
hostPort: 4222
9
- containerPort: 6222
10
hostPort: 6222

Reduce CPU contention. If the high-RTT server is CPU-bound, address the CPU issue first (see SERVER_003). Route processing competes with message routing and Raft operations for CPU time. Increasing server resources or reducing workload on that server improves route RTT as a side effect.

Long-term: design for low-latency clustering

Dedicated network for cluster traffic. Use a separate network interface or VLAN for route traffic, isolated from client traffic and external workloads. This eliminates bandwidth contention:

1
cluster {
2
listen: 10.0.1.1:6222 # dedicated cluster network
3
routes: [
4
nats://10.0.1.2:6222
5
nats://10.0.1.3:6222
6
]
7
}

Monitor route RTT continuously. Set up Prometheus alerting on route RTT before it reaches the threshold.

Synadia Insights evaluates route RTT every collection epoch and fires this check when any route connection exceeds the threshold, with per-route attribution so you can identify exactly which server pair is affected.

Frequently asked questions

What route RTT is considered normal for a NATS cluster?

Within a single datacenter or availability zone, route RTT should be under 2ms. Within the same cloud region across availability zones, 1–5ms is typical. Anything above 10ms within a region warrants investigation. Above 50ms, Raft consensus begins to degrade noticeably, and above 100ms, leader elections become frequent.

How is route RTT different from gateway RTT?

Routes are intra-cluster connections between servers in the same cluster. They carry message replication, subscription interest, and Raft consensus traffic. Gateway connections are inter-cluster — they connect separate NATS clusters across regions or organizational boundaries. Higher RTT on gateways is expected and acceptable because they carry a different traffic profile. High route RTT is more concerning because it directly impacts Raft consensus within the cluster.

Can high route RTT cause data loss?

Not directly. Raft consensus ensures that committed data is durable across replicas regardless of latency. However, high route RTT can cause leader elections, and during an election, JetStream writes are temporarily unavailable (not lost). For core NATS (non-JetStream), high route RTT increases the probability of slow consumer events on cross-server subscriptions, which can cause message drops.

Should I tune Raft timeouts to tolerate higher route RTT?

No. Raft timeouts in NATS are not user-configurable, and for good reason — they’re calibrated for the expected intra-cluster latency profile. If your route RTT is high enough to trigger Raft elections, the correct fix is to reduce route RTT (co-locate servers, fix network issues), not to make Raft more tolerant of bad conditions.

How does route RTT affect JetStream publish latency?

For R1 streams, route RTT has no impact on publish latency — writes are local. For R3 streams, a publish is acknowledged after the leader and at least one follower have persisted the message. The publish latency is at minimum the time for one route round trip (leader to follower and back). With 5ms route RTT, R3 stream publishes add at least 5ms over R1 publishes.

Proactive monitoring for NATS high route rtt with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel