NATS High Leaf RTT: What It Means and How to Fix It

High leaf RTT means the round-trip time between a leafnode server and its hub cluster exceeds the configured threshold (default: 100ms). Leafnodes bridge messages between edge locations and the central cluster. When the link is slow, every message crossing the leafnode boundary inherits that latency — request-reply calls time out, JetStream publish acknowledgments lag, and clients connected to the leaf experience degraded performance for any operation that touches the hub.

Why this matters

Leafnodes are designed for extending NATS to remote locations — branch offices, edge computing sites, IoT gateways, regional deployments. Some latency is expected and acceptable. But when RTT exceeds the threshold, the latency moves from “acceptable overhead” to “operational impact.”

The most immediate impact is on request-reply patterns. A service request from a leaf-connected client to a hub-connected responder requires at least two leafnode traversals: request out, reply back. At 100ms RTT, that’s a minimum 200ms added to every request — before the responder even processes it. If the client’s timeout is set to 1 second (a common default), a single RTT spike to 300ms consumes 600ms of the timeout budget just on network transit. Under load, these requests start timing out, and the application sees intermittent failures that correlate with nothing visible at the application layer.

JetStream operations through leafnodes are even more sensitive. A publish with acknowledgment (js.Publish) requires the message to travel from the leaf to the hub, be committed by the stream’s Raft group, and then send the ack back. At high RTT, publish throughput drops proportionally because each publish-ack cycle includes the leafnode round trip. Batch publishing helps, but the fundamental constraint is the speed-of-light delay plus any network overhead on the path.

Common causes

Geographic distance. The leafnode is physically far from the hub cluster. A leafnode in Singapore connecting to a hub in US-East inherently has 200ms+ RTT. This is physics, not a bug — but the check flags it so you can make architectural decisions accordingly.
Network congestion on the leafnode link. The network path between the leaf and hub is saturated. This is common when the leafnode shares an internet connection with other traffic, or when the link is a low-bandwidth WAN connection. RTT increases as packets queue behind other traffic.
VPN or tunnel overhead. Many leafnode deployments use VPN tunnels (WireGuard, IPsec, OpenVPN) for security. Each layer of encapsulation adds processing time, and VPN concentrators can become bottlenecks under load. Double-encapsulation (e.g., NATS TLS inside a VPN tunnel) compounds the overhead.
ISP routing inefficiency. The network path between leaf and hub may not be the shortest route. ISP peering arrangements can route traffic through distant exchange points, adding tens of milliseconds to what should be a short path. This is especially common on consumer-grade internet connections.
Overloaded leafnode server. If the leafnode server itself is under CPU or memory pressure, it responds slowly to NATS PING/PONG measurements, inflating the reported RTT. The network may be fine, but the server can’t process the ping fast enough.

How to diagnose

Check leafnode RTT from the hub

Query the hub server’s leafnode connections:

curl -s "http://localhost:8222/leafz" | jq '.leafs[] | {name: .name, rtt: .rtt, account: .account}'

This shows the measured RTT for each leafnode connection. Compare against the threshold (default 100ms).

Measure RTT from the leafnode side

On the leafnode itself:

nats rtt

This measures the round-trip time from the leafnode’s perspective. Compare this with the hub-side measurement. If they differ significantly, there may be asymmetric routing or measurement issues.

Check network path latency

Use standard network tools to isolate whether the latency is in the network or the NATS server:

# Basic ping
ping -c 20 <hub_server_ip>

# Traceroute to identify where latency accumulates
traceroute <hub_server_ip>

# MTR for continuous path analysis
mtr --report <hub_server_ip>

If ICMP ping shows similar latency to NATS RTT, the issue is network-level. If NATS RTT is significantly higher than ICMP ping, the NATS server is adding processing delay (check CPU/load on both sides).

Check for RTT variability

A stable RTT (even if high) is less problematic than a highly variable one. Sustained high RTT is predictable; spiky RTT causes intermittent timeouts that are harder to diagnose:

# Monitor RTT over time
watch -n 5 'curl -s "http://localhost:8222/leafz" | jq ".leafs[] | {name: .name, rtt: .rtt}"'

Check leafnode server resource usage

If the leafnode is under resource pressure, RTT measurements will be inflated:

nats server info

Check CPU usage and connection count. A leafnode handling many client connections while running on limited hardware may not have capacity to respond to pings promptly.

How to fix it

Immediate: set appropriate client timeouts

Adjust client timeouts to account for leafnode RTT. Clients connected to a leafnode should have longer timeouts than clients connected directly to the hub:

1
// Go — adjust timeouts for leaf-connected clients
2
nc, err := nats.Connect(leafURL,
3
    nats.Timeout(5*time.Second),        // Connection timeout
4
    nats.PingInterval(30*time.Second),   // Less aggressive ping
5
    nats.MaxPingsOutstanding(3),          // More tolerance for missed pings
6
)
7
if err != nil {
8
    log.Fatal(err)
9
}
10

11
// For JetStream, increase publish ack timeout
12
js, _ := nc.JetStream(nats.PublishAsyncMaxPending(256))
13
_, err = js.Publish("events.order", data, nats.AckWait(5*time.Second))

1
# Python — leaf-aware timeouts
2
import nats
3

4
nc = await nats.connect(
5
    "nats://leaf-server:4222",
6
    connect_timeout=5,
7
    ping_interval=30,
8
    max_outstanding_pings=3,
9
)
10

11
js = nc.jetstream()
12
ack = await js.publish("events.order", data, timeout=5.0)

Short-term: optimize the network path

Enable leafnode compression. If not already enabled, compression reduces the data volume on the leafnode link, which can reduce RTT by decreasing queuing delay on congested links:

1
# Leafnode configuration
2
leafnodes {
3
  remotes [{
4
    url: "nats://hub:7422"
5
    compression: s2_auto
6
  }]
7
}

Check VPN configuration. If using a VPN tunnel, ensure the MTU is set correctly to avoid fragmentation. Fragmented packets require reassembly, adding latency:

# Find the optimal MTU
ping -M do -s 1400 <hub_server_ip>
# Reduce size until pings succeed without fragmentation

Use a dedicated network link for the leafnode if it shares bandwidth with other traffic. Quality-of-service (QoS) rules can prioritize NATS traffic on the link.

Long-term: architect for latency

Use gateways instead of leafnodes for cross-region connectivity. If two locations both have full NATS clusters, gateways provide cluster-to-cluster connectivity with better throughput characteristics than leafnodes. Leafnodes are ideal for edge locations with a single server; gateways are better for region-to-region.

Deploy local JetStream streams at the leaf. Instead of routing all JetStream operations through the hub, create streams on the leafnode for data that’s produced and consumed locally. Use NATS subject mapping or mirrors/sources to replicate only the subset of data that needs to reach the hub:

# Create a local stream on the leafnode for edge data
nats stream add local-events \
    --subjects "events.local.>" \
    --storage file \
    --retention limits \
    --max-age 24h

Implement store-and-forward patterns. For data flows that must cross the leafnode boundary, publish to a local stream and use a source or mirror to replicate to the hub asynchronously. This decouples the client’s publish latency from the leafnode RTT:

# On the hub, create a stream that sources from the leaf's stream
nats stream add hub-events \
    --subjects "events.local.>" \
    --source local-events

Frequently asked questions

What RTT is normal for a leafnode?

It depends on geography and network topology. Same-datacenter leafnodes should be under 5ms. Same-region (e.g., US-East to US-East) should be under 20ms. Cross-continent (US to Europe) is typically 80-120ms. The default threshold of 100ms is set to flag connections where latency is high enough to impact request-reply patterns and JetStream operations. Adjust the threshold based on your expected deployment topology.

Does high leaf RTT cause slow consumer disconnections?

Indirectly. High RTT means the leafnode’s TCP connection to the hub drains more slowly. If the hub is sending high-volume traffic to subjects that cross the leafnode boundary, the hub-side buffer for the leafnode connection fills faster. In extreme cases, the leafnode connection itself can be flagged as a slow consumer and disconnected. This is rare but possible at very high message rates combined with high RTT.

Should I use a leafnode or a gateway for cross-region connectivity?

Leafnodes are designed for extending a single logical NATS system to remote locations, typically where you have one server at the edge. They’re lightweight and simple to configure. Gateways connect separate NATS clusters with independent identity, providing full cluster-to-cluster routing. If the remote location has a full cluster (3+ servers), gateways are typically better. If it’s a single server at an edge location, leafnodes are the right choice. High RTT affects both, but gateways handle it more gracefully because they have their own local cluster for client operations.

Can I reduce the leafnode RTT threshold?

Yes. The threshold is configurable in Synadia Insights. If your architecture expects cross-region leafnodes with inherent latency, increase the threshold so the check doesn’t fire for expected conditions. If all your leafnodes are same-region and you want tighter monitoring, decrease the threshold. The goal is to alert on unexpected latency, not on expected geographic distance.

FEATURED

RESOURCES

Comparisons