Checks/OPT_SYS_014

NATS Gateway Pending Pressure: Causes, Diagnosis, and Fixes

Severity
Warning
Category
Performance
Applies to
Server
Check ID
OPT_SYS_014
Detection threshold
gateway connection pending bytes exceed 1 MiB

Gateway pending pressure occurs when the outbound write buffer on a NATS gateway connection exceeds 1 MiB. Gateways are the inter-cluster communication links in a NATS super-cluster — when pending data accumulates faster than it drains, it signals that the receiving cluster cannot keep up with the sending cluster’s message rate, or the network between them is a bottleneck. Left unchecked, gateway pending pressure leads to message delivery latency across clusters, memory growth on the sending server, and eventually gateway disconnections.

Why this matters

Gateways carry all inter-cluster traffic: subscriptions, messages, and protocol control data. Unlike client connections where the server can evict a slow consumer to protect itself, gateway connections are infrastructure — losing a gateway means losing connectivity to an entire cluster. The server will tolerate much more pending data before severing a gateway than it would for a regular client.

This tolerance comes at a cost. While the gateway buffer fills, the sending server allocates increasingly large amounts of memory to hold unsent data. In a super-cluster with multiple gateways, one congested link can consume hundreds of megabytes. The sending server’s write loop for that gateway also backs up, adding latency to every message destined for the remote cluster. If the congestion is mutual — both clusters sending to each other through saturated links — the latency compounds bidirectionally.

The operational impact is subtle but severe. Request-reply patterns that span clusters start timing out. Consumers in one cluster that subscribe to streams sourced from another cluster see growing delivery delays. Monitoring and control plane traffic between clusters slows down, making the system harder to observe exactly when you need visibility most. If the pending pressure persists long enough, the server may disconnect the gateway entirely, triggering a full reconnect cycle that temporarily partitions the super-cluster.

Common causes

  • Insufficient network bandwidth between clusters. The most straightforward cause. If the aggregate message rate across a gateway link exceeds the available bandwidth (accounting for protocol overhead, encryption, and other traffic), pending data accumulates. This is especially common when clusters are connected over WAN links, shared VPNs, or cloud inter-region peering with throughput caps.

  • Poor stream and consumer placement. When a stream lives in Cluster A but most of its consumers are in Cluster B, every message crosses the gateway. If the stream is high-throughput, this single placement decision can saturate the inter-cluster link. The same applies to mirrors and sources that replicate data across clusters — each replicated message consumes gateway bandwidth.

  • Wildcard subscriptions generating excessive fan-out. A subscriber in one cluster with a broad wildcard (> or events.>) forces the gateway to forward matching messages from every other cluster. The aggregate rate of matched messages can far exceed what any individual subject would produce.

  • Gateway interest mode not engaged. By default, NATS gateways start in “optimistic” mode, forwarding all messages to the remote cluster. The gateway switches to “interest” mode once it learns the remote cluster has no subscribers for a subject. If subscriptions churn rapidly or subjects are highly dynamic, the gateway may spend significant time in optimistic mode, forwarding messages that will be dropped on arrival.

  • Burst traffic from batch operations. Imports, backfills, or migration jobs that publish large volumes of data in short bursts can overwhelm gateway capacity even when steady-state traffic fits comfortably. The pending buffer absorbs the burst, but if the burst duration exceeds the buffer’s drain time, pressure builds.

  • TLS overhead on constrained links. Gateway connections are typically TLS-encrypted. On links where CPU is limited or bandwidth is marginal, the encryption/decryption overhead reduces effective throughput, contributing to pending buildup.

How to diagnose

Check gateway pending bytes

Use the NATS CLI to inspect gateway connections and their pending state:

Terminal window
curl -s 'http://localhost:8222/connz?sort=pending_bytes&limit=20' | jq '.connections[]'

Look for gateway connections (indicated by the connection type) with elevated pending bytes. Any gateway connection consistently above 1 MiB warrants investigation.

For a more detailed view of gateway-specific metrics:

Terminal window
nats server report gateways

This shows per-gateway statistics including RTT, message rates, and byte rates for each inter-cluster link.

Inspect the monitoring endpoint

The server’s /gatewayz endpoint provides detailed gateway state:

Terminal window
curl -s http://localhost:8222/gatewayz | jq '.outbound_gateways | to_entries[] | {name: .key, pending: .value.connection.pending_bytes, rtt: .value.connection.rtt}'

Key fields:

  • pending_bytes — Current bytes waiting to be written to the remote cluster
  • rtt — Round-trip time to the remote gateway
  • in_msgs / in_bytes / out_msgs / out_bytes — Traffic volume through the gateway

Correlate with network metrics

Terminal window
# Check gateway RTT from the NATS perspective
nats server list | grep gateway

If RTT is high (above 20ms for same-region, above 100ms for cross-region), the network link is likely the bottleneck. Cross-reference with infrastructure monitoring for packet loss, bandwidth utilization, and interface errors on the hosts running the NATS servers.

Identify traffic sources

Determine which subjects are driving the most inter-cluster traffic:

Terminal window
# Check stream placements and consumer locations
nats stream ls -a --json | jq -r '.[] | "\(.config.name) cluster=\(.cluster.name) replicas=\(.cluster.replicas | length)"'
# Check which accounts are generating gateway traffic
nats server report accounts

If a single stream or account dominates gateway traffic, targeted placement changes can relieve the pressure.

Programmatic monitoring

1
package main
2
3
import (
4
"encoding/json"
5
"fmt"
6
"net/http"
7
)
8
9
type GatewayzResp struct {
10
OutboundGateways map[string]struct {
11
Connection struct {
12
PendingBytes int64 `json:"pending_bytes"`
13
RTT string `json:"rtt"`
14
} `json:"connection"`
15
} `json:"outbound_gateways"`
16
}
17
18
func checkGatewayPending(monitorURL string, thresholdBytes int64) error {
19
resp, err := http.Get(monitorURL + "/gatewayz")
20
if err != nil {
21
return err
22
}
23
defer resp.Body.Close()
24
25
var gw GatewayzResp
26
if err := json.NewDecoder(resp.Body).Decode(&gw); err != nil {
27
return err
28
}
29
30
for name, info := range gw.OutboundGateways {
31
if info.Connection.PendingBytes > thresholdBytes {
32
fmt.Printf("WARN: gateway %s pending=%d bytes rtt=%s\n",
33
name, info.Connection.PendingBytes, info.Connection.RTT)
34
}
35
}
36
return nil
37
}
1
import httpx
2
3
async def check_gateway_pending(monitor_url: str, threshold_bytes: int = 1_048_576):
4
resp = await httpx.AsyncClient().get(f"{monitor_url}/gatewayz")
5
data = resp.json()
6
alerts = []
7
for name, gw in data.get("outbound_gateways", {}).items():
8
pending = gw.get("connection", {}).get("pending_bytes", 0)
9
if pending > threshold_bytes:
10
alerts.append({
11
"gateway": name,
12
"pending_bytes": pending,
13
"rtt": gw["connection"].get("rtt"),
14
})
15
return alerts

How to fix it

Relocate streams closer to their consumers. If most consumers of a stream are in a remote cluster, move the stream (or add a mirror) to that cluster. This converts inter-cluster gateway traffic into intra-cluster route traffic, which typically has much higher bandwidth available:

Terminal window
# Create a cross-domain mirror in the consumer's cluster.
# Cross-domain mirrors are configured via the `--mirror nats:STREAM@DOMAIN`
# syntax (or by feeding a JSON stream config to `--config`); there is no
# `--mirror-domain` flag on `nats stream add`.
nats stream add orders-mirror \
--mirror nats:orders@hub \
--storage file \
--replicas 3

Restrict wildcard subscriptions. If broad wildcards are driving excessive gateway fan-out, narrow the subscription scope or move the subscribing service to the cluster where the data originates.

Throttle batch operations. If bulk imports or backfills are spiking gateway traffic, rate-limit the publisher or schedule batch work during off-peak hours.

Short-term: improve network capacity

Upgrade inter-cluster bandwidth. If the gateway link is genuinely saturated, the fix is more bandwidth. In cloud environments, this may mean moving to dedicated interconnects, larger instance types with higher network performance, or placement groups that optimize cross-AZ throughput.

Enable or verify gateway compression. NATS supports S2 compression on gateway connections, which can significantly reduce bandwidth consumption for compressible payloads:

nats-server.conf
1
gateway {
2
name: "cluster-east"
3
port: 7222
4
compression: s2_auto
5
gateways: [
6
{ name: "cluster-west", urls: ["nats://west-1:7222"] }
7
]
8
}

For constrained links, s2_fast provides a good compression ratio with minimal CPU overhead.

Long-term: architect for locality

Design topic topologies that minimize cross-cluster traffic. Keep producers and consumers of high-throughput subjects in the same cluster. Use NATS subject mapping or account-level imports/exports to control which traffic crosses cluster boundaries.

Use JetStream sources instead of raw subscriptions for cross-cluster data. Sources give you explicit control over which streams replicate across clusters and can be paused or rate-limited. Raw subscriptions on gateways offer no such control — every matching message flows through.

Monitor gateway metrics continuously. Set up Prometheus alerts on gateway pending bytes to catch pressure before it becomes critical.

Synadia Insights evaluates gateway pending pressure automatically across your entire super-cluster deployment, flagging links that are under sustained pressure before they degrade into disconnections.

Frequently asked questions

How is gateway pending pressure different from route pending pressure?

Route pending pressure (CLUSTER_005) occurs between servers within the same cluster. Gateway pending pressure occurs between servers in different clusters. Routes typically use high-bandwidth local network links, so pressure is less common. Gateways often cross WAN links with lower bandwidth and higher latency, making them more susceptible to pending buildup. The diagnostic and remediation approaches are similar, but gateway pressure usually requires network-level or architectural changes rather than simple configuration adjustments.

Will increasing the pending buffer limit prevent disconnections?

The server does not expose a configurable pending limit for gateways in the same way it does for client connections. Gateway connections are handled differently — the server maintains larger internal buffers and tolerates more pending data because gateway health is critical to super-cluster connectivity. If pending pressure reaches the point of disconnection, the root cause is sustained bandwidth exhaustion, not buffer sizing. Focus on reducing traffic volume or increasing network capacity.

NATS itself does not support message prioritization on gateway connections. All messages share the same TCP connection and pending buffer. However, you can achieve effective prioritization by placing latency-sensitive streams in the same cluster as their consumers (eliminating gateway traversal entirely) and reserving gateway bandwidth for traffic that genuinely needs to cross cluster boundaries.

Does gateway interest mode help with pending pressure?

Yes. When a gateway transitions from optimistic mode to interest-only mode for a subject, it stops forwarding messages that the remote cluster has no subscribers for. This can significantly reduce gateway traffic. However, interest mode transitions require the remote cluster to have no subscriptions on the subject — wildcard subscriptions prevent interest-mode optimization. Check gateway interest mode status with nats server report gateways and see the Gateway Interest Mode check (OPT_SYS_003) for optimization guidance.

Proactive monitoring for NATS gateway pending pressure with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel