High client RTT means the round-trip time between a NATS client connection and the server exceeds the configured threshold (default: 100ms). RTT is measured via NATS PING/PONG at the protocol level, so it captures the true end-to-end latency including network transit, TLS overhead, and any proxies in the path. High RTT directly increases the latency of every request-reply call, JetStream publish acknowledgment, and synchronous NATS operation the client performs.
Every synchronous NATS operation includes at least one round trip. A Request() call sends a message and waits for a reply — that’s one RTT minimum on each side (client to server, server to responder, responder to server, server back to client). At 100ms client RTT, a simple request-reply between two clients on the same server takes at least 200ms just in network transit, before either side processes anything. At 200ms RTT, you’re at 400ms. Client timeouts that seemed generous at 1 second suddenly leave very little room for actual processing.
JetStream operations are particularly affected. Every js.Publish() with acknowledgment waits for the server to commit the message to the stream’s Raft group and return an ack. The publish throughput ceiling for a single client is roughly 1 / RTT publishes per second for synchronous operations. At 10ms RTT, that’s 100 publishes/second. At 100ms, it drops to 10. Async publishing mitigates this for fire-and-forget patterns, but any workflow that needs publish confirmation is bound by RTT.
High client RTT also increases slow consumer risk. The server writes messages to the client’s TCP connection. When RTT is high, the TCP window fills more slowly because acknowledgments take longer to return. At high message rates, the server-side buffer for the connection fills faster than the client can drain it — and once the buffer exceeds the pending limit, the server disconnects the client as a slow consumer (SERVER_004). This makes high client RTT a leading indicator of slow consumer events: fixing RTT preemptively prevents the more disruptive disconnection.
Client in a different region than the server. The most common and most straightforward cause. A client in Europe connecting to a server in US-East has 80-120ms RTT due to physical distance. This is not a bug — it’s geography — but it affects performance, and the operator should decide whether to accept it or move the client closer.
Network congestion between client and server. The network path is congested, adding queuing delay on top of propagation delay. Common in shared network environments, during peak traffic hours, or when the client shares a link with bandwidth-heavy workloads (backups, large file transfers).
Proxy or load balancer in the connection path. HTTP proxies, TCP load balancers, or service meshes (Istio, Linkerd) between the client and server add processing time to every packet. Each hop adds latency, and some proxies buffer data, adding additional delay. NATS connections are long-lived TCP connections — a proxy optimized for short-lived HTTP requests may perform poorly.
TLS handshake and encryption overhead. TLS adds latency at connection establishment (handshake) and ongoing processing (encrypt/decrypt). On resource-constrained devices — IoT gateways, embedded systems, older hardware — TLS processing can measurably increase RTT. The PING/PONG measurement includes this processing time.
Client host under resource pressure. If the client machine is CPU-saturated, memory-swapping, or experiencing high I/O wait, it responds slowly to server PINGs. The measured RTT reflects client-side processing delay, not just network latency. This is especially common in containerized environments where CPU limits throttle the process.
DNS resolution on reconnect. After a disconnect, the client resolves the server hostname before reconnecting. Slow DNS adds seconds (not milliseconds) to reconnection, and during that window the client appears to have infinite RTT. Once connected, DNS is not a factor in ongoing RTT — but it can cause elevated average RTT in monitoring if reconnects are frequent.
From a client machine, measure the RTT to the server:
nats rttThis performs a NATS-level PING/PONG measurement, showing the true application-layer RTT including TLS and any intermediate hops.
On the server side, identify which clients have the highest RTT:
nats server listThis lists all connections sorted by RTT. Focus on connections above the threshold (default 100ms). Note the client name, account, and server — this tells you which clients are affected and where they’re connected.
Determine whether the latency is network-level or NATS-level:
# Network-level ping from the client hostping -c 20 <nats_server_ip>
# Compare with NATS-level RTTnats rttIf ICMP ping is significantly lower than NATS RTT, the difference is TLS overhead, proxy overhead, or client-side processing delay. If they’re similar, the issue is purely network latency.
If clients connect through a load balancer or proxy:
# Check the client's connection server infonats server info
# Compare the server IP the client sees vs. the actual server IP# If they differ, a proxy is in the pathCheck whether high RTT is constant or intermittent:
# Continuous monitoringwatch -n 5 'nats server list | head -20'Constant high RTT suggests geographic distance or a persistent network issue. Intermittent spikes suggest congestion, resource pressure, or GC pauses on the client.
Use the cluster’s nearest server. NATS clusters, leafnodes, and gateways let you deploy servers close to clients. Connect clients to the nearest server rather than routing everything through a central cluster:
1// Go — connect to the nearest server with fallback2nc, err := nats.Connect(3 "nats://local-server:4222,nats://remote-server:4222",4 nats.Name("order-processor"),5 nats.RetryOnFailedConnect(true),6 nats.MaxReconnects(-1),7 nats.ReconnectWait(2*time.Second),8)1# Python — multi-server URL with nearest first2import nats3
4nc = await nats.connect(5 servers=["nats://local-server:4222", "nats://remote-server:4222"],6 name="order-processor",7 max_reconnect_attempts=-1,8 reconnect_time_wait=2,9)Bypass proxies where possible. If clients connect through a TCP proxy or load balancer that isn’t required (e.g., NATS client libraries already handle load balancing via the cluster URL list), connect directly to the NATS servers. The client library discovers all cluster members via the INFO protocol and distributes connections automatically.
Check container resource limits. If high RTT correlates with CPU throttling in Kubernetes:
# Check if the pod is being CPU-throttledkubectl top pod <pod_name>kubectl describe pod <pod_name> | grep -A 2 "Limits"If CPU limits are too restrictive, the client process can’t respond to PINGs promptly. Increase CPU limits or requests to give the process headroom.
Optimize TLS configuration. If TLS overhead is a factor, ensure you’re using TLS 1.3 (faster handshake) and modern cipher suites. On resource-constrained devices, consider whether mutual TLS (mTLS) is required or if server-only TLS is sufficient for your security model.
Deploy leafnodes at edge locations. Instead of connecting remote clients directly to a central cluster, deploy a leafnode server at each edge location. Clients connect to the local leafnode with sub-millisecond RTT. The leafnode handles the high-RTT connection to the hub, and latency-sensitive operations (local pub/sub, request-reply between co-located services) stay fast:
1# Leafnode configuration at edge location2server_name: "edge-chicago"3port: 42224
5leafnodes {6 remotes [{7 url: "nats://hub-cluster:7422"8 credentials: "/etc/nats/edge.creds"9 }]10}Use async publishing for JetStream. For workloads where publish-ack round-trip time is the bottleneck, switch to async publishing:
1// Go — async JetStream publish to avoid RTT-per-message bottleneck2js, _ := nc.JetStream(nats.PublishAsyncMaxPending(256))3for _, event := range events {4 _, err := js.PublishAsync("events.order", event)5 if err != nil {6 log.Printf("Publish error: %v", err)7 }8}9select {10case <-js.PublishAsyncComplete():11 log.Println("All publishes confirmed")12case <-time.After(10 * time.Second):13 log.Println("Timeout waiting for publish confirmations")14}Implement request-reply timeouts proportional to RTT. Clients with high RTT need longer timeouts. Rather than using a global default, set timeouts based on expected RTT:
1// Adjust request timeout based on measured conditions2timeout := 2*time.Second // Default for local connections3if isRemote {4 timeout = 5 * time.Second // Remote clients need more headroom5}6reply, err := nc.Request("service.check", data, timeout)SERVER_010 measures RTT on route and gateway connections between NATS servers — the infrastructure plane. CONN_001 measures RTT on client connections — the application plane. They have different default thresholds: SERVER_010 defaults to 50ms because server-to-server connections are typically within or between data centers, while CONN_001 defaults to 100ms because client connections have more variability (different locations, networks, devices). Both indicate latency problems, but they affect different things: server RTT affects cluster replication and message routing, client RTT affects application-perceived performance.
For pure pub/sub without request-reply, the impact is minimal. Messages flow from publisher to server to subscriber asynchronously — RTT doesn’t add per-message latency because there’s no acknowledgment cycle. The subscriber receives messages after a one-way delay (roughly half the RTT), but there’s no round-trip wait. However, high RTT does increase slow consumer risk at high message rates because the TCP feedback loop is slower, and the server’s outbound buffer fills before the client can acknowledge received data at the TCP level.
JetStream consumers have an ack_wait timeout — if the consumer doesn’t acknowledge a message within this window, it’s redelivered. The ack must travel from the consumer to the server, so the effective processing time available is ack_wait - RTT. At 100ms RTT, a 30-second ack_wait still leaves 29.9 seconds for processing. But if ack_wait is set aggressively low (e.g., 1 second) and RTT is 200ms, only 800ms remains for processing, leading to unnecessary redeliveries and high redelivery rates (OPT_SYS_002).
Not necessarily. Some high-RTT clients are expected — mobile apps, IoT devices, remote office workers. The check exists to surface connections where high RTT is unexpected or where it’s degrading performance. If you have legitimate remote clients, consider adjusting the threshold per-account or filtering by client name pattern. Focus alerting on clients that should be low-latency (same-datacenter services, critical microservices) but aren’t.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community