Checks/CONN_001

NATS High Client RTT: What It Means and How to Fix It

Severity
Warning
Category
Performance
Applies to
Connection
Check ID
CONN_001
Detection threshold
client connection round-trip time exceeds configured maximum (default: 100ms)

High client RTT means the round-trip time between a NATS client connection and the server exceeds the configured threshold (default: 100ms). RTT is measured via NATS PING/PONG at the protocol level, so it captures the true end-to-end latency including network transit, TLS overhead, and any proxies in the path. High RTT directly increases the latency of every request-reply call, JetStream publish acknowledgment, and synchronous NATS operation the client performs.

Why this matters

Every synchronous NATS operation includes at least one round trip. A Request() call sends a message and waits for a reply — that’s one RTT minimum on each side (client to server, server to responder, responder to server, server back to client). At 100ms client RTT, a simple request-reply between two clients on the same server takes at least 200ms just in network transit, before either side processes anything. At 200ms RTT, you’re at 400ms. Client timeouts that seemed generous at 1 second suddenly leave very little room for actual processing.

JetStream operations are particularly affected. Every js.Publish() with acknowledgment waits for the server to commit the message to the stream’s Raft group and return an ack. The publish throughput ceiling for a single client is roughly 1 / RTT publishes per second for synchronous operations. At 10ms RTT, that’s 100 publishes/second. At 100ms, it drops to 10. Async publishing mitigates this for fire-and-forget patterns, but any workflow that needs publish confirmation is bound by RTT.

High client RTT also increases slow consumer risk. The server writes messages to the client’s TCP connection. When RTT is high, the TCP window fills more slowly because acknowledgments take longer to return. At high message rates, the server-side buffer for the connection fills faster than the client can drain it — and once the buffer exceeds the pending limit, the server disconnects the client as a slow consumer (SERVER_004). This makes high client RTT a leading indicator of slow consumer events: fixing RTT preemptively prevents the more disruptive disconnection.

Common causes

  • Client in a different region than the server. The most common and most straightforward cause. A client in Europe connecting to a server in US-East has 80-120ms RTT due to physical distance. This is not a bug — it’s geography — but it affects performance, and the operator should decide whether to accept it or move the client closer.

  • Network congestion between client and server. The network path is congested, adding queuing delay on top of propagation delay. Common in shared network environments, during peak traffic hours, or when the client shares a link with bandwidth-heavy workloads (backups, large file transfers).

  • Proxy or load balancer in the connection path. HTTP proxies, TCP load balancers, or service meshes (Istio, Linkerd) between the client and server add processing time to every packet. Each hop adds latency, and some proxies buffer data, adding additional delay. NATS connections are long-lived TCP connections — a proxy optimized for short-lived HTTP requests may perform poorly.

  • TLS handshake and encryption overhead. TLS adds latency at connection establishment (handshake) and ongoing processing (encrypt/decrypt). On resource-constrained devices — IoT gateways, embedded systems, older hardware — TLS processing can measurably increase RTT. The PING/PONG measurement includes this processing time.

  • Client host under resource pressure. If the client machine is CPU-saturated, memory-swapping, or experiencing high I/O wait, it responds slowly to server PINGs. The measured RTT reflects client-side processing delay, not just network latency. This is especially common in containerized environments where CPU limits throttle the process.

  • DNS resolution on reconnect. After a disconnect, the client resolves the server hostname before reconnecting. Slow DNS adds seconds (not milliseconds) to reconnection, and during that window the client appears to have infinite RTT. Once connected, DNS is not a factor in ongoing RTT — but it can cause elevated average RTT in monitoring if reconnects are frequent.

How to diagnose

Measure your own client RTT

From a client machine, measure the RTT to the server:

Terminal window
nats rtt

This performs a NATS-level PING/PONG measurement, showing the true application-layer RTT including TLS and any intermediate hops.

Find high-RTT clients across the cluster

On the server side, identify which clients have the highest RTT:

Terminal window
nats server list

This lists all connections sorted by RTT. Focus on connections above the threshold (default 100ms). Note the client name, account, and server — this tells you which clients are affected and where they’re connected.

Compare NATS RTT with network RTT

Determine whether the latency is network-level or NATS-level:

Terminal window
# Network-level ping from the client host
ping -c 20 <nats_server_ip>
# Compare with NATS-level RTT
nats rtt

If ICMP ping is significantly lower than NATS RTT, the difference is TLS overhead, proxy overhead, or client-side processing delay. If they’re similar, the issue is purely network latency.

Check for proxies in the connection path

If clients connect through a load balancer or proxy:

Terminal window
# Check the client's connection server info
nats server info
# Compare the server IP the client sees vs. the actual server IP
# If they differ, a proxy is in the path

Check whether high RTT is constant or intermittent:

Terminal window
# Continuous monitoring
watch -n 5 'nats server list | head -20'

Constant high RTT suggests geographic distance or a persistent network issue. Intermittent spikes suggest congestion, resource pressure, or GC pauses on the client.

How to fix it

Immediate: connect clients to the nearest server

Use the cluster’s nearest server. NATS clusters, leafnodes, and gateways let you deploy servers close to clients. Connect clients to the nearest server rather than routing everything through a central cluster:

1
// Go — connect to the nearest server with fallback
2
nc, err := nats.Connect(
3
"nats://local-server:4222,nats://remote-server:4222",
4
nats.Name("order-processor"),
5
nats.RetryOnFailedConnect(true),
6
nats.MaxReconnects(-1),
7
nats.ReconnectWait(2*time.Second),
8
)
1
# Python — multi-server URL with nearest first
2
import nats
3
4
nc = await nats.connect(
5
servers=["nats://local-server:4222", "nats://remote-server:4222"],
6
name="order-processor",
7
max_reconnect_attempts=-1,
8
reconnect_time_wait=2,
9
)

Short-term: remove unnecessary network hops

Bypass proxies where possible. If clients connect through a TCP proxy or load balancer that isn’t required (e.g., NATS client libraries already handle load balancing via the cluster URL list), connect directly to the NATS servers. The client library discovers all cluster members via the INFO protocol and distributes connections automatically.

Check container resource limits. If high RTT correlates with CPU throttling in Kubernetes:

Terminal window
# Check if the pod is being CPU-throttled
kubectl top pod <pod_name>
kubectl describe pod <pod_name> | grep -A 2 "Limits"

If CPU limits are too restrictive, the client process can’t respond to PINGs promptly. Increase CPU limits or requests to give the process headroom.

Optimize TLS configuration. If TLS overhead is a factor, ensure you’re using TLS 1.3 (faster handshake) and modern cipher suites. On resource-constrained devices, consider whether mutual TLS (mTLS) is required or if server-only TLS is sufficient for your security model.

Long-term: architect for latency

Deploy leafnodes at edge locations. Instead of connecting remote clients directly to a central cluster, deploy a leafnode server at each edge location. Clients connect to the local leafnode with sub-millisecond RTT. The leafnode handles the high-RTT connection to the hub, and latency-sensitive operations (local pub/sub, request-reply between co-located services) stay fast:

1
# Leafnode configuration at edge location
2
server_name: "edge-chicago"
3
port: 4222
4
5
leafnodes {
6
remotes [{
7
url: "nats://hub-cluster:7422"
8
credentials: "/etc/nats/edge.creds"
9
}]
10
}

Use async publishing for JetStream. For workloads where publish-ack round-trip time is the bottleneck, switch to async publishing:

1
// Go — async JetStream publish to avoid RTT-per-message bottleneck
2
js, _ := nc.JetStream(nats.PublishAsyncMaxPending(256))
3
for _, event := range events {
4
_, err := js.PublishAsync("events.order", event)
5
if err != nil {
6
log.Printf("Publish error: %v", err)
7
}
8
}
9
select {
10
case <-js.PublishAsyncComplete():
11
log.Println("All publishes confirmed")
12
case <-time.After(10 * time.Second):
13
log.Println("Timeout waiting for publish confirmations")
14
}

Implement request-reply timeouts proportional to RTT. Clients with high RTT need longer timeouts. Rather than using a global default, set timeouts based on expected RTT:

1
// Adjust request timeout based on measured conditions
2
timeout := 2*time.Second // Default for local connections
3
if isRemote {
4
timeout = 5 * time.Second // Remote clients need more headroom
5
}
6
reply, err := nc.Request("service.check", data, timeout)

Frequently asked questions

What’s the difference between High Client RTT (CONN_001) and High Server RTT (SERVER_010)?

SERVER_010 measures RTT on route and gateway connections between NATS servers — the infrastructure plane. CONN_001 measures RTT on client connections — the application plane. They have different default thresholds: SERVER_010 defaults to 50ms because server-to-server connections are typically within or between data centers, while CONN_001 defaults to 100ms because client connections have more variability (different locations, networks, devices). Both indicate latency problems, but they affect different things: server RTT affects cluster replication and message routing, client RTT affects application-perceived performance.

Does high RTT affect core NATS (non-JetStream) subscribers?

For pure pub/sub without request-reply, the impact is minimal. Messages flow from publisher to server to subscriber asynchronously — RTT doesn’t add per-message latency because there’s no acknowledgment cycle. The subscriber receives messages after a one-way delay (roughly half the RTT), but there’s no round-trip wait. However, high RTT does increase slow consumer risk at high message rates because the TCP feedback loop is slower, and the server’s outbound buffer fills before the client can acknowledge received data at the TCP level.

How does client RTT interact with JetStream ack_wait?

JetStream consumers have an ack_wait timeout — if the consumer doesn’t acknowledge a message within this window, it’s redelivered. The ack must travel from the consumer to the server, so the effective processing time available is ack_wait - RTT. At 100ms RTT, a 30-second ack_wait still leaves 29.9 seconds for processing. But if ack_wait is set aggressively low (e.g., 1 second) and RTT is 200ms, only 800ms remains for processing, leading to unnecessary redeliveries and high redelivery rates (OPT_SYS_002).

Should I alert on every high-RTT client?

Not necessarily. Some high-RTT clients are expected — mobile apps, IoT devices, remote office workers. The check exists to surface connections where high RTT is unexpected or where it’s degrading performance. If you have legitimate remote clients, consider adjusting the threshold per-account or filtering by client name pattern. Focus alerting on clients that should be low-latency (same-datacenter services, critical microservices) but aren’t.

Proactive monitoring for NATS high client rtt with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel