An underutilized server is a NATS cluster member that consistently runs below 5% CPU and serves fewer than 10 client connections across the observed time range, with at least 5 data samples confirming the pattern. The server is participating in the cluster — maintaining route connections, exchanging gossip, hosting Raft groups — but doing almost no useful work for clients.
Every server in a NATS cluster has a baseline cost: infrastructure spend (compute, storage, network), operational overhead (monitoring, patching, certificate rotation), and cluster protocol costs (route connections, subscription propagation, Raft heartbeats). An underutilized server pays all of these costs while contributing almost nothing to actual message throughput. In cloud environments, this is direct waste — you’re paying for a node that processes near-zero client traffic.
The operational cost is less obvious but real. Every server in the cluster is another node to monitor, upgrade, and maintain. During rolling upgrades, it’s another step in the process. During incident response, it’s another node to check. If the cluster has 5 servers and one is idle, 20% of your cluster maintenance effort is spent on a node that handles < 1% of the traffic.
Underutilized servers can also mask real capacity problems. If you’re monitoring aggregate cluster capacity and one server appears available, planning models may overestimate headroom. When a traffic spike hits, the idle server can’t absorb load if clients aren’t configured to use it — the spare capacity is theoretical, not practical. The cluster’s effective capacity is lower than it appears.
Clients don’t include this server in their connection URLs. The most common cause. If application configurations list nats://s1:4222,nats://s2:4222 but omit s3, no client will ever connect to s3 directly. The server participates in routing but never serves clients.
DNS discovery doesn’t resolve to this server. If clients use a DNS name to discover NATS servers and the DNS record doesn’t include this server’s IP, clients never connect to it. This happens when servers are added to the cluster but the DNS record isn’t updated.
Recently added server that hasn’t been discovered. NATS clients learn about new cluster members through gossip (the server sends its known peers during the INFO handshake). However, existing long-lived connections won’t discover new servers until they reconnect. If clients rarely reconnect, the new server stays empty.
Server placed in the wrong placement group. If streams and consumers use placement tags and this server’s tags don’t match any stream’s placement requirements, it won’t host JetStream workloads. It sits in the cluster but has no data to serve.
Leftover from a scale-up that was never scaled back. The cluster was expanded during a traffic spike or migration. Traffic subsided or the migration completed, but the extra server was never decommissioned. It remains in the cluster, idle.
nats server listCompare connection counts and message rates across all servers. An underutilized server shows near-zero connections and minimal in/out message counts compared to its peers.
# Check current CPU per servernats server report jetstreamThe /varz monitoring endpoint provides CPU usage:
curl -s http://localhost:8222/varz | jq '{cpu: .cpu, connections: .connections, in_msgs: .in_msgs, out_msgs: .out_msgs}'Query this on the suspect server. If CPU is consistently below 5% and connections are in single digits, the server is underutilized. Check across multiple time points to confirm it’s sustained, not just a momentary lull.
nats server report jetstreamLook at the Streams and Consumers columns for the suspect server. If both are zero, the server isn’t hosting any JetStream workload. If it hosts some streams, check whether they’re active (see OPT_IDLE_002) — it may be an underutilized server hosting inactive streams.
# Test direct connectivitynats rtt --server nats://<server_address>:4222If the server is reachable but clients aren’t connecting to it, the problem is configuration, not network.
Review application deployment configs (Kubernetes manifests, environment variables, config files) to see if this server’s address is included in the NATS connection URL list.
If you want to keep the server, direct traffic to it:
1// Go — ensure all servers are listed in client connections2nc, err := nats.Connect(3 "nats://s1:4222,nats://s2:4222,nats://s3:4222", // include the idle server4 nats.Name("my-service"),5)1# Python (nats.py) — include all servers2nc = await nats.connect(3 servers=[4 "nats://s1:4222",5 "nats://s2:4222",6 "nats://s3:4222", # include the idle server7 ],8 name="my-service",9)Update DNS records if clients use DNS-based discovery:
# Add the idle server's IP to the DNS recordnats.example.com. 300 IN A 10.0.1.10nats.example.com. 300 IN A 10.0.1.11nats.example.com. 300 IN A 10.0.1.12 # was missingAfter updating configurations, perform a rolling restart of client applications to trigger reconnection. New connections will distribute across all servers including the previously idle one.
If the server should host stream replicas, use placement tags or step-down operations to distribute data:
# Step down leaders on busy servers — the idle server may pick up leadershipnats stream cluster step-down <stream_name>
# Or explicitly place new streams on the idle server via tagsnats stream add NEW_STREAM \ --subjects "data.>" \ --replicas 3 \ --tag region:us-eastIf the server is genuinely unneeded, remove it from the cluster cleanly:
nats-server --signal ldm=<server_name> # send SIGUSR2 to put the server in lame-duck modeRemove any JetStream data the server hosts. If it holds stream replicas, they’ll be re-placed on other servers once the server leaves the cluster.
Shut down the server process:
nats-server --signal quitnats-server --signal reloadTrack per-server utilization as a standard metric and flag idle servers automatically.
Build server utilization reviews into your quarterly operational audits. For cloud deployments, tag idle servers for cost review and consider autoscaling policies that remove nodes when sustained utilization drops below a threshold.
Yes, modestly. Every server in the cluster maintains route connections to every other server, propagates subscription interest, and participates in meta cluster Raft (if JetStream is enabled). These costs are small per-server, but they scale with cluster size. An idle server adds latency to subscription propagation and consumes Raft heartbeat bandwidth without contributing to message throughput.
It depends on your replication factor. If you run R3 streams in a 5-server cluster and remove one idle server, you still have 4 servers — enough for R3 with one failure tolerated. If you’re in a 3-server cluster, removing any server drops you to 2, which means R3 streams lose fault tolerance entirely. Check your stream replica counts and cluster size before decommissioning.
The check requires consistent low utilization across the observed time range with at least 5 data samples. In practice, wait at least a few days to rule out weekly traffic patterns, and check for monthly patterns if your workload is cyclical. A server that’s idle Monday through Thursday but handles weekend batch jobs shouldn’t be removed.
Not directly. NATS doesn’t load-balance client connections — clients choose which server to connect to based on their connection URL list. However, NATS client libraries randomize the server list by default, so if all servers are listed, connections naturally distribute. The cluster gossip protocol also informs clients about all known servers, so clients that reconnect can discover and connect to the underutilized server automatically.
Having headroom is good practice — you don’t want every server running at 90% so that a single failure overwhelms the survivors. But a server with near-zero traffic isn’t providing failover capacity unless clients are configured to use it. An idle server that no client knows about can’t absorb traffic during a failure. Fix the client configuration first, then decide if you need the additional capacity.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community