Checks/OPT_IDLE_001

NATS Underutilized Server: What It Means and How to Fix It

Severity
Info
Category
Health
Applies to
Idle Resources
Check ID
OPT_IDLE_001
Detection threshold
Max CPU < 5% AND max connections < 10 across time range (minimum 5 samples)

An underutilized server is a NATS cluster member that consistently runs below 5% CPU and serves fewer than 10 client connections across the observed time range, with at least 5 data samples confirming the pattern. The server is participating in the cluster — maintaining route connections, exchanging gossip, hosting Raft groups — but doing almost no useful work for clients.

Why this matters

Every server in a NATS cluster has a baseline cost: infrastructure spend (compute, storage, network), operational overhead (monitoring, patching, certificate rotation), and cluster protocol costs (route connections, subscription propagation, Raft heartbeats). An underutilized server pays all of these costs while contributing almost nothing to actual message throughput. In cloud environments, this is direct waste — you’re paying for a node that processes near-zero client traffic.

The operational cost is less obvious but real. Every server in the cluster is another node to monitor, upgrade, and maintain. During rolling upgrades, it’s another step in the process. During incident response, it’s another node to check. If the cluster has 5 servers and one is idle, 20% of your cluster maintenance effort is spent on a node that handles < 1% of the traffic.

Underutilized servers can also mask real capacity problems. If you’re monitoring aggregate cluster capacity and one server appears available, planning models may overestimate headroom. When a traffic spike hits, the idle server can’t absorb load if clients aren’t configured to use it — the spare capacity is theoretical, not practical. The cluster’s effective capacity is lower than it appears.

Common causes

  • Clients don’t include this server in their connection URLs. The most common cause. If application configurations list nats://s1:4222,nats://s2:4222 but omit s3, no client will ever connect to s3 directly. The server participates in routing but never serves clients.

  • DNS discovery doesn’t resolve to this server. If clients use a DNS name to discover NATS servers and the DNS record doesn’t include this server’s IP, clients never connect to it. This happens when servers are added to the cluster but the DNS record isn’t updated.

  • Recently added server that hasn’t been discovered. NATS clients learn about new cluster members through gossip (the server sends its known peers during the INFO handshake). However, existing long-lived connections won’t discover new servers until they reconnect. If clients rarely reconnect, the new server stays empty.

  • Server placed in the wrong placement group. If streams and consumers use placement tags and this server’s tags don’t match any stream’s placement requirements, it won’t host JetStream workloads. It sits in the cluster but has no data to serve.

  • Leftover from a scale-up that was never scaled back. The cluster was expanded during a traffic spike or migration. Traffic subsided or the migration completed, but the extra server was never decommissioned. It remains in the cluster, idle.

How to diagnose

Check server connection and message counts

Terminal window
nats server list

Compare connection counts and message rates across all servers. An underutilized server shows near-zero connections and minimal in/out message counts compared to its peers.

Confirm low CPU over time

Terminal window
# Check current CPU per server
nats server report jetstream

The /varz monitoring endpoint provides CPU usage:

Terminal window
curl -s http://localhost:8222/varz | jq '{cpu: .cpu, connections: .connections, in_msgs: .in_msgs, out_msgs: .out_msgs}'

Query this on the suspect server. If CPU is consistently below 5% and connections are in single digits, the server is underutilized. Check across multiple time points to confirm it’s sustained, not just a momentary lull.

Check if the server hosts any JetStream data

Terminal window
nats server report jetstream

Look at the Streams and Consumers columns for the suspect server. If both are zero, the server isn’t hosting any JetStream workload. If it hosts some streams, check whether they’re active (see OPT_IDLE_002) — it may be an underutilized server hosting inactive streams.

Verify the server is reachable by clients

Terminal window
# Test direct connectivity
nats rtt --server nats://<server_address>:4222

If the server is reachable but clients aren’t connecting to it, the problem is configuration, not network.

Check client configuration for missing server URLs

Review application deployment configs (Kubernetes manifests, environment variables, config files) to see if this server’s address is included in the NATS connection URL list.

How to fix it

Option A: redirect workload to the idle server

If you want to keep the server, direct traffic to it:

1
// Go — ensure all servers are listed in client connections
2
nc, err := nats.Connect(
3
"nats://s1:4222,nats://s2:4222,nats://s3:4222", // include the idle server
4
nats.Name("my-service"),
5
)
1
# Python (nats.py) — include all servers
2
nc = await nats.connect(
3
servers=[
4
"nats://s1:4222",
5
"nats://s2:4222",
6
"nats://s3:4222", # include the idle server
7
],
8
name="my-service",
9
)

Update DNS records if clients use DNS-based discovery:

Terminal window
# Add the idle server's IP to the DNS record
nats.example.com. 300 IN A 10.0.1.10
nats.example.com. 300 IN A 10.0.1.11
nats.example.com. 300 IN A 10.0.1.12 # was missing

After updating configurations, perform a rolling restart of client applications to trigger reconnection. New connections will distribute across all servers including the previously idle one.

Option A continued: move JetStream workload to it

If the server should host stream replicas, use placement tags or step-down operations to distribute data:

Terminal window
# Step down leaders on busy servers — the idle server may pick up leadership
nats stream cluster step-down <stream_name>
# Or explicitly place new streams on the idle server via tags
nats stream add NEW_STREAM \
--subjects "data.>" \
--replicas 3 \
--tag region:us-east

Option B: decommission the idle server

If the server is genuinely unneeded, remove it from the cluster cleanly:

  1. Drain the server to migrate any remaining connections:
Terminal window
nats-server --signal ldm=<server_name> # send SIGUSR2 to put the server in lame-duck mode
  1. Remove any JetStream data the server hosts. If it holds stream replicas, they’ll be re-placed on other servers once the server leaves the cluster.

  2. Shut down the server process:

Terminal window
nats-server --signal quit
  1. Remove the server from cluster route configurations on other servers and reload:
Terminal window
nats-server --signal reload
  1. Remove from monitoring and alerting to avoid false positive alerts about a missing node.

Long-term: automate utilization monitoring

Track per-server utilization as a standard metric and flag idle servers automatically.

Build server utilization reviews into your quarterly operational audits. For cloud deployments, tag idle servers for cost review and consider autoscaling policies that remove nodes when sustained utilization drops below a threshold.

Frequently asked questions

Is an underutilized server wasting cluster resources beyond its own infrastructure cost?

Yes, modestly. Every server in the cluster maintains route connections to every other server, propagates subscription interest, and participates in meta cluster Raft (if JetStream is enabled). These costs are small per-server, but they scale with cluster size. An idle server adds latency to subscription propagation and consumes Raft heartbeat bandwidth without contributing to message throughput.

Will removing a server from the cluster affect fault tolerance?

It depends on your replication factor. If you run R3 streams in a 5-server cluster and remove one idle server, you still have 4 servers — enough for R3 with one failure tolerated. If you’re in a 3-server cluster, removing any server drops you to 2, which means R3 streams lose fault tolerance entirely. Check your stream replica counts and cluster size before decommissioning.

How long should a server be idle before I consider removing it?

The check requires consistent low utilization across the observed time range with at least 5 data samples. In practice, wait at least a few days to rule out weekly traffic patterns, and check for monthly patterns if your workload is cyclical. A server that’s idle Monday through Thursday but handles weekend batch jobs shouldn’t be removed.

Can NATS automatically route clients to underutilized servers?

Not directly. NATS doesn’t load-balance client connections — clients choose which server to connect to based on their connection URL list. However, NATS client libraries randomize the server list by default, so if all servers are listed, connections naturally distribute. The cluster gossip protocol also informs clients about all known servers, so clients that reconnect can discover and connect to the underutilized server automatically.

Should I keep an extra server for failover capacity?

Having headroom is good practice — you don’t want every server running at 90% so that a single failure overwhelms the survivors. But a server with near-zero traffic isn’t providing failover capacity unless clients are configured to use it. An idle server that no client knows about can’t absorb traffic during a failure. Fix the client configuration first, then decide if you need the additional capacity.

Proactive monitoring for NATS underutilized server with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel