Checks/CLUSTER_005

NATS Route Count Low: What It Means and How to Fix It

Severity
Warning
Category
Health
Applies to
Cluster
Check ID
CLUSTER_005
Detection threshold
active routes fewer than expected for cluster size (expected: N-1 routes per server)

A low route count means a NATS server has fewer active cluster route connections than expected for the cluster size. In a fully meshed N-node cluster, each server should have exactly N-1 routes. A missing route means at least one cluster peer is unreachable, breaking the full mesh and potentially isolating clients, fragmenting subscription interest, and disrupting JetStream Raft consensus.

Why this matters

NATS clusters form a full mesh topology — every server maintains a direct route connection to every other server in the cluster. These routes carry subscription interest propagation, message forwarding between servers, and JetStream Raft replication traffic. When a route is missing, the affected servers cannot communicate directly.

The impact depends on which route is missing and what traffic flows through it. At minimum, clients connected to one server cannot reach subscribers connected to the disconnected peer — messages published on one side don’t reach subscriptions on the other. Subscription interest propagation stops, so new subscriptions created on one server aren’t visible to the disconnected peer. For core NATS, this means silent message loss: publishers succeed (NATS is fire-and-forget), but the messages never reach subscribers on the unreachable server.

For JetStream, the consequences are more severe. Raft groups require a majority of members to communicate for leader election and log replication. In a 3-node cluster, losing one route connection can prevent Raft groups from achieving quorum if the disconnected server holds a replica. Streams may lose their leader, become read-only, or stall entirely. The meta cluster — which manages all JetStream asset metadata — is also a Raft group and is equally affected. Without meta quorum, no new streams or consumers can be created cluster-wide.

A low route count is often the first signal of a network partition, a crashed server, or a misconfigured firewall rule. Detecting it quickly — before operators notice the downstream symptoms of message loss or JetStream stalls — is critical for maintaining cluster health.

Common causes

  • Network partition between cluster peers. A switch failure, VLAN misconfiguration, or routing change isolates one or more servers from their peers. Route connections drop, but both sides remain running and serving their local clients — unaware that the cluster is fragmented.

  • Firewall blocking the route port. The default route port is 6222. A firewall rule change, security group update, or iptables modification that blocks this port prevents route connections from establishing. This commonly happens after infrastructure changes that don’t account for NATS’s separate ports for client (4222), route (6222), and gateway (7222) traffic.

  • Server crash or process failure. If a cluster peer crashes, its route connections drop. The remaining servers see a reduced route count until the crashed server restarts. This is often accompanied by SERVER_008 (Server Restarted) or SERVER_001 (Server Health) alerts.

  • DNS resolution failure. If route URLs use hostnames and DNS resolution fails for one peer, the server cannot establish or re-establish the route. This is common in Kubernetes environments where pod DNS depends on CoreDNS availability.

  • Missing or incorrect route URL in configuration. A server’s configuration is missing the route URL for one or more peers. This can happen when a new server is added to the cluster but not all existing servers are updated to include the new peer’s route URL in their config.

  • TLS handshake failure. If routes are configured with TLS and certificates have expired, been rotated inconsistently, or have mismatched CA chains, the TLS handshake fails and the route cannot be established. The server logs will show TLS errors, but the visible symptom is a missing route.

How to diagnose

Check route counts across all servers

Terminal window
nats server list

Look at the Routes column. In an N-node cluster, every server should show N-1 routes. Any server showing fewer routes has a connectivity problem.

Identify which peer is missing

Query the route details from the server with the low count:

Terminal window
curl -s http://<server-host>:8222/routez | jq '.routes[] | {remote_id, ip, port}'

Compare the list of connected route peers against the expected cluster membership. The missing entry identifies which peer is unreachable.

Check server logs for route errors

Server logs will show route connection failures with details:

Terminal window
# Search for route-related errors
journalctl -u nats-server --since "1 hour ago" | grep -i "route"

Common log patterns:

  • Error trying to connect to route — active connection attempt failing
  • Route connection closed — an established route dropped
  • TLS handshake error — certificate issue on the route connection

Test network connectivity on the cluster port

Terminal window
# Test route port connectivity from one server to another
nc -zv <peer-host> 6222
# Check if the route port is listening on the target server
ss -tlnp | grep 6222
# Check firewall rules for the cluster port
iptables -L -n | grep 6222
# Verify DNS resolution for route hostnames
dig <peer-hostname>

If the port check fails, the issue is network-level (firewall, routing, DNS resolution, or the server isn’t listening). Expected route count is N-1 for a full-mesh cluster of N servers.

Check route RTT

If routes are connected but unstable, check the latency:

Terminal window
curl -s http://<server-host>:8222/routez | jq '.routes[] | {remote_id, rtt}'

High RTT on a route connection can cause timeouts and intermittent disconnections.

How to fix it

Immediate: restore connectivity

If a server has crashed, restart it:

Terminal window
systemctl restart nats-server

If the issue is a firewall rule, open the route port:

Terminal window
# Example: allow route port in iptables
iptables -A INPUT -p tcp --dport 6222 -j ACCEPT
# Or in cloud security groups, ensure port 6222 is open
# between all cluster member IPs

If DNS is failing, verify resolution and consider using IP addresses as a temporary workaround:

1
cluster {
2
name: "C1"
3
routes = [
4
"nats-route://10.0.1.10:6222"
5
"nats-route://10.0.1.11:6222"
6
"nats-route://10.0.1.12:6222"
7
]
8
}

Short-term: fix configuration gaps

Ensure all servers have matching cluster names and complete route configuration listing every cluster peer. A full mesh requires each server to list at least one other server’s route URL (gossip handles the rest, but listing all is recommended for resilience):

1
cluster {
2
name: "C1"
3
listen: "0.0.0.0:6222"
4
routes = [
5
"nats-route://s1.example.com:6222"
6
"nats-route://s2.example.com:6222"
7
"nats-route://s3.example.com:6222"
8
]
9
}

It’s safe (and recommended) for a server to include its own address in the routes list — it will simply skip connecting to itself.

If TLS certificates have expired, rotate them and reload:

Terminal window
# After updating certificates
nats-server --signal reload=<pid>

TLS configuration changes on routes take effect on reload without a restart — new route connections will use the updated certificates.

Long-term: automate and monitor

Use configuration management for route URLs. In dynamic environments (Kubernetes, auto-scaling groups), use DNS-based route discovery or configuration management to ensure route URLs stay current:

1
# Kubernetes: NATS Helm chart handles route discovery automatically
2
# For manual deployments, use a shared config template
3
cluster {
4
name: "C1"
5
routes = [
6
{% for server in nats_servers %}
7
"nats-route://{{ server.hostname }}:6222"
8
{% endfor %}
9
]
10
}

Monitor route counts continuously. Export the route count from /routez to your monitoring stack.

Set up certificate rotation automation. If routes use TLS, automate certificate renewal with cert-manager (Kubernetes) or certbot, and configure NATS to reload on certificate changes. Expired certificates are a preventable cause of route failures.

Test route connectivity in CI/CD. Before deploying configuration changes that affect networking (firewall rules, security groups, route URLs), validate that all cluster members can reach each other on the route port.

Frequently asked questions

How quickly does NATS detect a missing route?

NATS servers detect a dropped route connection almost immediately through TCP keepalives and the internal ping/pong mechanism. Once detected, the server begins attempting to re-establish the route. Reconnection attempts use exponential backoff, so a transiently unavailable peer will reconnect within seconds. If the peer is down or unreachable, the server continues retrying indefinitely.

Can the cluster still function with a missing route?

For core NATS, the cluster functions in a degraded state — servers that can still reach each other continue forwarding messages, but clients on the disconnected server are isolated. For JetStream, it depends on the cluster size: in a 3-node cluster, losing one route may break Raft quorum for groups that include the disconnected server. In a 5-node cluster, losing one server still leaves a majority for quorum.

Does adding a new server to the cluster require updating all existing configs?

If you list all route URLs explicitly, yes — every existing server needs the new server’s route URL added. However, NATS supports route gossip: once a new server connects to any existing server, the existing servers learn about the new peer and establish routes automatically. You only need one existing server’s URL in the new server’s config. That said, listing all URLs is the best practice for resilience — it ensures the new server can join even if its initial contact server is down.

What’s the difference between a missing route and a gateway disconnection?

Routes connect servers within the same cluster (intra-cluster). Gateways connect servers in different clusters (inter-cluster). A missing route (CLUSTER_005) means a cluster peer is unreachable. A gateway disconnection (CLUSTER_007) means an entire remote cluster is unreachable. Both are connectivity issues but at different scopes, and they use different ports (route: 6222, gateway: 7222 by default).

Can I have a NATS cluster without full mesh routes?

No. NATS requires a full mesh between all cluster members — every server must have a route to every other server. There is no partial mesh or hub-and-spoke topology for cluster routes. If you need to connect servers across regions without full mesh, use gateways (for cluster-to-cluster) or leafnodes (for leaf-to-hub).

Proactive monitoring for NATS route count low with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel