Checks/OPT_SYS_005

NATS Route Pending Pressure: What It Means and How to Fix It

Severity
Warning
Category
Performance
Applies to
System Improvement
Check ID
OPT_SYS_005
Detection threshold
route connection pending bytes > 1 MiB

Route Pending Pressure means a route connection between two NATS cluster peers has accumulated more than 1 MiB of data waiting to be written to the network. The sending server is producing inter-server traffic faster than the network or the receiving server can consume it — the route is becoming a bottleneck for intra-cluster communication.

Why this matters

Route connections carry all intra-cluster traffic: message forwarding for subscriptions on remote servers, Raft consensus operations for JetStream, and subscription interest propagation. When pending data accumulates on a route, it means the entire communication channel between two servers is backing up.

The consequences cascade through the cluster. Messages destined for subscribers on the remote server experience increasing latency as they wait in the pending buffer. Raft heartbeats and append-entries queued behind message data arrive late, potentially triggering unnecessary leader elections. If the pending buffer continues to grow, the server may eventually mark the route as a slow consumer and disconnect it — losing inter-server connectivity temporarily until the route reconnects.

Route pending pressure is often an early indicator of a deeper problem. It can signal network bandwidth saturation between servers, CPU contention preventing the remote server from reading fast enough, or a workload pattern that generates more inter-server traffic than the infrastructure can handle. Addressing route pending pressure early prevents escalation to route disconnections, leader elections, and cascading cluster instability.

Common causes

  • Network bandwidth saturation between peers. The aggregate message rate across all subjects that route between two servers exceeds the available network bandwidth. This is common in cloud environments with burstable network performance — baseline bandwidth may be sufficient, but burst capacity is exhausted during traffic peaks.

  • High-fanout subjects with cross-server subscribers. A subject like events.> has subscribers on multiple servers. Every message published to that subject is forwarded to each server with subscribers. High-volume subjects with wide cluster-level fan-out multiply the traffic on routes proportionally.

  • Large message payloads. Subjects carrying large messages (>100 KB) fill the route pending buffer faster than subjects carrying small messages at the same message rate. A subject with 1,000 msg/s at 1 MB each generates ~1 GB/s of route traffic per destination server.

  • Remote server CPU contention. The receiving server is under heavy CPU load and cannot read from the route connection fast enough. The sending server’s outbound buffer fills because the remote peer’s read loop is delayed by CPU-bound work — message routing, Raft processing, or garbage collection.

  • Raft write amplification. In clusters with many R3 JetStream streams, Raft replication traffic multiplies message volume. Each publish to an R3 stream generates append-entries messages to two follower servers. With hundreds of streams, the aggregate Raft traffic on routes can be substantial.

  • Network latency or packet loss. Elevated RTT between servers reduces TCP throughput (bounded by bandwidth-delay product). Packet loss triggers TCP retransmissions and congestion window reduction, further lowering effective throughput. Even moderate packet loss (0.1%) can significantly impact route throughput under load.

How to diagnose

Check route connection pending bytes

Terminal window
curl -s 'http://localhost:8222/connz?sort=pending_bytes&limit=20' | jq '.connections[]'

Filter for route connections (identified by connection type). Any route with pending bytes above 1 MiB is flagged by this check.

Inspect per-route details via monitoring endpoint

Terminal window
curl http://localhost:8222/routez?subs=detail | jq '.routes[] | {rid, remote_id, ip, port, pending_bytes, rtt, in_msgs, out_msgs, in_bytes, out_bytes}'

This shows per-route pending bytes alongside throughput metrics. Compare out_bytes with pending_bytes — if pending is a significant fraction of throughput, the route is congested.

Check network bandwidth between servers

Terminal window
# Test available bandwidth between cluster peers
iperf3 -c <remote_server_ip> -t 10

Compare the measured bandwidth with the route traffic volume. If route traffic approaches the available bandwidth, network saturation is the bottleneck.

Identify high-volume cross-server subjects

Terminal window
# Check per-subject message rates
nats server report accounts

Look for subjects with high message rates that have subscribers across multiple servers. These subjects generate the most route traffic.

Check the remote server’s CPU and processing capacity

Terminal window
nats server report cpu

If the remote server (the one receiving route traffic) shows high CPU, it may not be reading from routes fast enough, causing backpressure on the sending side.

How to fix it

Immediate: reduce route traffic volume

Reduce fan-out on high-volume subjects. If a subject has subscribers on every server but only needs to be processed by one instance, use a queue group to limit message forwarding:

Terminal window
# Queue groups ensure each message is delivered to one subscriber,
# reducing inter-server traffic
nats sub "events.>" --queue event-processors

Move publishers and subscribers to the same server. If a high-volume subject’s publisher is on server A and all subscribers are on server B, every message crosses the route. Co-locating them eliminates route traffic for that subject entirely. Use client connection URLs to target specific servers.

Short-term: increase network capacity and optimize

Upgrade network bandwidth between cluster peers. In cloud environments, switch to instance types with higher baseline network bandwidth. In on-premises deployments, bond multiple NICs or upgrade to faster interconnects:

Terminal window
# Verify current link speed
ethtool eth0 | grep Speed

Enable route compression. Route connections can use S2 compression to reduce bandwidth consumption. In most cases, S2 compression significantly reduces the bytes on the wire with minimal CPU overhead:

1
cluster {
2
listen: 0.0.0.0:6222
3
compression: s2_auto
4
routes: [
5
nats://server2:6222
6
nats://server3:6222
7
]
8
}

Reduce JetStream replication where possible. Streams that don’t require high availability can use R1 instead of R3, eliminating Raft replication traffic for those streams:

Terminal window
nats stream edit <stream_name> --replicas 1

Long-term: design for sustainable inter-server throughput

Partition workloads by server affinity. Design subject hierarchies and stream placement to minimize cross-server traffic. Use placement tags to co-locate related streams and their consumers:

1
// Place the stream where its consumers connect
2
js, _ := nc.JetStream()
3
_, err := js.AddStream(&nats.StreamConfig{
4
Name: "REGIONAL_ORDERS",
5
Subjects: []string{"orders.us-east.>"},
6
Placement: &nats.Placement{
7
Tags: []string{"us-east"},
8
},
9
Replicas: 3,
10
})
1
from nats.js.api import StreamConfig, Placement
2
3
await js.add_stream(StreamConfig(
4
name="REGIONAL_ORDERS",
5
subjects=["orders.us-east.>"],
6
placement=Placement(tags=["us-east"]),
7
num_replicas=3,
8
))

Use dedicated network for cluster traffic. Separate route traffic from client traffic using different network interfaces. This prevents client workload from competing with inter-server replication for bandwidth:

1
cluster {
2
listen: 10.0.1.1:6222 # dedicated cluster network
3
routes: [
4
nats://10.0.1.2:6222
5
nats://10.0.1.3:6222
6
]
7
}
8
9
listen: 10.0.0.1:4222 # client network

Monitor route pending continuously. Alert before the 1 MiB threshold to catch trends early.

Synadia Insights evaluates route pending pressure every collection epoch with per-route attribution, so you can identify exactly which server pair is experiencing backpressure.

Frequently asked questions

What happens if route pending pressure isn’t addressed?

If the pending buffer continues to grow, the server eventually disconnects the route as a slow consumer to protect its own resources. A route disconnection means temporary loss of inter-server connectivity — messages for subscribers on the remote server are not delivered, and Raft replication stalls. The route auto-reconnects, but the disconnection-reconnection cycle can destabilize Raft leader elections and cause JetStream write stalls.

How is route pending pressure different from client pending pressure?

Route pending pressure (OPT_SYS_005) is between two servers in the same cluster — it affects message forwarding and Raft replication. Client pending pressure (CONN_002) is between a server and a single client — it affects that client’s message delivery. Route pressure has a wider blast radius because it impacts all traffic between two servers, not just one client’s subscription.

Does route compression help with pending pressure?

Yes. S2 compression on route connections can reduce bandwidth usage by 50-80% for typical NATS message payloads (JSON, Protobuf). This directly reduces the rate at which the pending buffer fills. Use compression: s2_auto for adaptive compression that adjusts based on RTT, or s2_fast for a fixed low-overhead mode. Note that compression adds CPU overhead — if the bottleneck is CPU rather than bandwidth, compression may not help.

Can I increase the route pending buffer limit?

The route pending buffer size is managed internally by the NATS server and is not directly configurable. The server uses the same slow consumer mechanism for routes as for clients. The correct fix is to reduce the rate of pending data accumulation (less traffic, more bandwidth, faster remote processing), not to increase the buffer.

How do I determine which subjects are causing the most route traffic?

Use nats server report accounts to see per-account message rates, then investigate subjects within high-traffic accounts. The /routez endpoint with subs=detail shows subscription counts per route, helping identify which subjects drive the most cross-server forwarding. Subjects with many messages and subscribers on multiple servers contribute the most route traffic.

Proactive monitoring for NATS route pending pressure with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel