Checks/SERVER_003

NATS High CPU Usage: What It Means and How to Fix It

Severity
Warning
Category
Performance
Applies to
Server
Check ID
SERVER_003
Detection threshold
per-core CPU usage meets or exceeds 90%

High CPU usage means a NATS server’s per-core CPU consumption has reached or exceeded 90%. Sustained high CPU indicates the server is at capacity — message latency increases, Raft heartbeats may time out causing unnecessary leader elections, and slow consumer disconnections become more likely.

Why this matters

NATS servers are designed to be efficient, but they’re not immune to CPU saturation. The server handles message routing, subscription matching, TLS encryption/decryption, Raft consensus operations, and client connection management — all on the same CPU budget. When CPU usage hits 90%+, all of these operations compete for cycles, and the effects cascade.

Message latency is the first symptom. As CPU becomes scarce, the time to match a published message against the subscription tree and write it to the destination client’s buffer increases. What was sub-millisecond routing becomes multi-millisecond. For latency-sensitive workloads (request-reply patterns, real-time event processing), this is immediately noticeable. For high-throughput fire-and-forget patterns, the throughput ceiling drops.

The more dangerous effect is on Raft consensus. JetStream Raft groups rely on timely heartbeats between the leader and followers. When the leader’s CPU is saturated, heartbeats are delayed. Followers interpret missed heartbeats as leader failure and start elections. The election itself consumes CPU, and the new leader — on the same or a similarly loaded server — may hit the same problem. This creates a feedback loop of leader flapping (META_003) that destabilizes JetStream operations cluster-wide. Slow consumer disconnections also increase under CPU pressure, as the server’s write loop for client connections falls behind.

Common causes

  • High message throughput. The most common cause. The server is routing millions of messages per second and has reached its throughput ceiling. Message routing involves subscription tree lookup, subject matching, and buffer writes for each message — all CPU-intensive at scale.

  • Large number of active connections. Each connection has its own read/write goroutines and associated overhead. Servers with 10,000+ connections spend significant CPU just managing connection state, keep-alives, and buffer management, even at moderate message rates.

  • TLS on all connections. TLS encryption and decryption adds measurable CPU overhead per message, per connection. In deployments where every client connection and every route/gateway connection uses TLS, the cryptographic overhead can consume 20-40% of CPU at high message rates.

  • Complex subscription routing with wildcards. Subscription trees with thousands of wildcard subscriptions (>, *) require more CPU per message to match. A publish to events.orders.us.east.store-42 must be checked against every potentially matching wildcard subscription — the broader and deeper the wildcard usage, the more matching work per message.

  • JetStream Raft replication with many R3 groups. Each R3 stream and consumer maintains a Raft group with regular heartbeats, proposal processing, and log management. Clusters with hundreds of R3 assets generate significant background CPU load from Raft alone, even at low message rates.

  • Garbage collection pressure. The NATS server is written in Go. Under heavy load with large pending buffers and many connections, GC cycles consume CPU time. This is usually a minor contributor compared to message routing but can amplify other causes.

How to diagnose

Check current CPU usage

Terminal window
nats server list

Look at the CPU column. Values at or above 90% per core indicate the threshold has been reached.

Get detailed server metrics

Terminal window
curl -s http://localhost:8222/varz | jq '{cpu: .cpu, cores: .cores, connections: .connections, in_msgs: .in_msgs, out_msgs: .out_msgs, subscriptions: .subscriptions}'

The cpu field shows a percentage that can exceed 100 on multi-core systems (e.g., 400 means 400% across cores). Divide by cores to get per-core utilization.

Use nats-top for real-time monitoring

Terminal window
nats-top

nats-top provides a real-time view of message rates (in/out), byte throughput, connection counts, and slow consumers. Watch it during peak traffic to identify what’s driving CPU consumption.

Identify hot connections

Terminal window
nats server report connections --sort out-msgs

Sort connections by message rate to find the highest-throughput publishers and subscribers. A single publisher sending millions of messages per second to a subject with many subscribers can dominate CPU.

Check subscription count and fan-out

Terminal window
nats server report connections --sort subs

Connections with thousands of subscriptions or subjects with high fan-out create disproportionate CPU load during message routing.

Check if Raft is contributing

Terminal window
nats server report jetstream

Examine the number of Raft groups per server. Servers hosting 500+ Raft groups have significant background CPU overhead from Raft heartbeats and log management.

How to fix it

Immediate: reduce the load

Profile CPU usage with the debug endpoint. Use the /pprof/profile debug endpoint to identify which workload is driving CPU:

Terminal window
# Capture a 30-second CPU profile
curl -o cpu.prof http://localhost:8222/debug/pprof/profile?seconds=30
go tool pprof -http=:8080 cpu.prof

Common causes visible in profiles: high-fanout subjects (subscription matching dominates), subscription matching overhead from broad wildcards, and JetStream write pressure (Raft proposal batching and fsync).

Identify and throttle the highest-throughput publishers. If a single publisher or workload is dominating CPU, rate-limit it temporarily:

Terminal window
# Identify top publishers
nats server report connections --sort out-msgs
# Check per-subject rates
nats server report accounts

If a runaway publisher is flooding a subject, fix it at the application level or use account-level message rate limits.

Drain non-critical connections. If the server is overloaded, temporarily drain connections to reduce load while you address the root cause:

Terminal window
# Enter lame duck mode to gracefully shed connections
nats-server --signal ldm=<pid>

Note: this takes the server out of the cluster. Only use as a last resort on the most overloaded server if other servers can absorb the load.

Short-term: optimize the workload

Distribute connections across servers. If one server has disproportionately more connections, rebalance:

Terminal window
# Check connection distribution
nats server report connections

Configure clients with multiple server URLs so they distribute across the cluster. Use DNS round-robin or a load balancer for initial connection distribution.

Add queue group subscribers to spread processing. Instead of one subscriber handling all messages on a subject, use queue groups to distribute the load:

1
// Go: queue group subscription
2
nc, _ := nats.Connect(url)
3
nc.QueueSubscribe("orders.>", "processors", func(msg *nats.Msg) {
4
// Each message goes to one member of the queue group
5
process(msg)
6
})
1
# Python: queue group subscription
2
import nats
3
4
async def run():
5
nc = await nats.connect()
6
sub = await nc.subscribe("orders.>", queue="processors")
7
async for msg in sub.messages:
8
await process(msg)

Reduce TLS overhead between trusted servers. If all route and gateway connections use TLS but run within a trusted network (same datacenter, VPC), consider disabling TLS on internal routes to reclaim CPU:

1
# nats-server.conf — cluster routes without TLS
2
cluster {
3
port: 6222
4
# no tls block = no TLS on routes
5
routes [
6
nats-route://server-b:6222
7
nats-route://server-c:6222
8
]
9
}

Keep TLS on client-facing connections. Only remove it on routes and gateways within trusted network boundaries.

Verify GOMAXPROCS matches available cores. By default, Go uses all available CPU cores. In containerized environments, the Go runtime may see the host’s cores rather than the container’s CPU limit:

Terminal window
# Check current setting
curl -s http://localhost:8222/varz | jq '.gomaxprocs, .cores'

If gomaxprocs doesn’t match the actual available cores (e.g., container CPU limit), set it explicitly:

Terminal window
GOMAXPROCS=4 nats-server -c nats.conf

Long-term: scale the infrastructure

Scale out with additional servers. Add servers to the cluster to distribute the routing load. NATS clusters scale horizontally — each additional server absorbs a share of connections and message routing:

1
# Add a new server to the cluster
2
cluster {
3
routes [
4
nats-route://server-a:6222
5
nats-route://server-b:6222
6
nats-route://server-c:6222
7
nats-route://server-d:6222 # new server
8
]
9
}

Optimize subscription patterns. Replace broad wildcards with specific subscriptions where possible. A subscriber on events.> matching 10,000 subjects creates more routing work than 10 subscribers each on a specific prefix:

Terminal window
# Instead of one subscriber on events.>
# Use multiple focused subscribers
nats sub "events.orders.>"
nats sub "events.inventory.>"
nats sub "events.shipping.>"

Use leafnodes for edge workloads. Leafnodes reduce message routing load on the core cluster by handling local message routing at the edge:

1
# Leafnode server config
2
leafnodes {
3
remotes [{
4
url: "nats-leaf://hub-server:7422"
5
}]
6
}

Set up CPU alerting. Alert before CPU reaches critical levels so you can act proactively.

Synadia Insights evaluates per-core CPU usage automatically every collection epoch and flags servers exceeding the threshold, providing context on what workloads are driving the load.

Frequently asked questions

What per-core CPU usage is normal for a NATS server?

It depends on workload, but most production NATS servers run between 10-50% per-core CPU. Below 10% may indicate an underutilized server (OPT_IDLE_001). Above 70% warrants monitoring. Above 90% triggers this check because it leaves no headroom for traffic spikes or Raft elections. The ideal operating range is 30-60% — enough headroom for spikes while efficiently using resources.

Does high CPU cause message loss?

Not directly, but indirectly yes. High CPU leads to slow consumer disconnections (SERVER_004) — the server can’t write to client buffers fast enough, the buffer fills, and the client is disconnected. For core NATS subscribers, this means message loss. For JetStream consumers, it means temporary disconnection with replay on reconnect. High CPU also causes Raft leader elections, which can briefly interrupt JetStream publishes.

How does NATS server CPU relate to the number of messages per second?

Roughly linearly for message routing — doubling the message rate approximately doubles the CPU spent on routing. However, factors like message size, subscription tree complexity, TLS, and Raft overhead add non-linear components. A server routing 1 million small messages per second with no TLS and few subscriptions uses very different CPU than one routing 100,000 large TLS-encrypted messages with thousands of wildcard subscriptions.

Should I give the NATS server more CPU cores or higher clock speed?

Both help, but for different workloads. Message routing parallelizes well across cores — more cores handle more concurrent connections and subscription matching. Raft consensus and some internal coordination are more single-threaded and benefit from higher clock speeds. For most workloads, more cores is the better investment. Ensure GOMAXPROCS is set to use all available cores.

Can I limit CPU usage for the NATS server?

You can use OS-level CPU limiting (cgroups, container CPU limits), but this just ensures the server hits its ceiling sooner. The better approach is to reduce the load (fewer connections, lower message rates, less TLS) or increase capacity (more servers, more cores). Artificially limiting CPU on a server that needs it will make the problem worse, not better.

Proactive monitoring for NATS high cpu usage with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel