Checks/OPT_BALANCE_007

NATS Stream-Consumer Leader Co-location: Avoiding I/O Hotspots

Severity
Info
Category
Saturation
Applies to
JetStream
Check ID
OPT_BALANCE_007
Detection threshold
stream leader's server hosts more than half of the stream's consumer leaders

A stream-consumer leader co-location alert fires when the server hosting a stream’s Raft leader also hosts more than half of that stream’s consumer leaders. This concentrates both the stream write path (appending messages) and the consumer delivery path (tracking acknowledgments, managing redelivery) on a single server, creating an I/O and CPU hotspot that limits throughput and reduces resilience.

Why this matters

In a NATS JetStream cluster, the stream leader handles all incoming publishes for the stream — it appends messages to storage and replicates them to followers. Each consumer leader independently tracks delivery state, processes acknowledgments, and manages redelivery timers. Both are I/O-intensive operations.

A single server carries disproportionate load. When the stream leader and most consumer leaders share the same server, that server handles: all message writes, all replication coordination, most consumer ack processing, and most redelivery scheduling. The other servers in the cluster sit comparatively idle while one server is saturated.

Throughput hits a ceiling. The bottleneck becomes the single server’s disk I/O, CPU, and network bandwidth. Adding more consumers doesn’t improve aggregate throughput if they all land on the same server. The cluster has horizontal capacity that isn’t being used.

A single server failure has outsized impact. If the co-located server goes down, the cluster loses the stream leader and most consumer leaders simultaneously. While Raft will elect new leaders, the recovery involves multiple leadership transitions happening at once, which can cause a brief but noticeable processing pause.

Consumer latency increases under load. Consumer leaders compete with the stream leader for the same server’s resources. During high-publish-rate periods, the stream leader consumes more disk I/O and CPU, leaving less for consumer ack processing. Consumers experience higher acknowledgment latency, which can trigger redelivery timeouts and duplicate processing.

Common causes

  • Default Raft leader election behavior. Raft doesn’t consider workload distribution when electing leaders. If a server happens to win the stream leader election, its consumer Raft groups may also elect it as leader due to having the most up-to-date log. This creates accidental co-location without any explicit misconfiguration.

  • Server with the fastest disk or lowest latency. If one server has measurably faster storage or lower network latency to peers, Raft elections naturally favor it. It wins more elections across multiple Raft groups, concentrating leadership.

  • All consumers created around the same time. When consumers are created in a batch (e.g., during deployment), they all go through initial leader election simultaneously. The server that’s most responsive at that moment tends to win all the elections.

  • No leader distribution policy. Without explicit leader balancing (via nats server cluster step-down or automated rebalancing), leadership naturally drifts toward whichever server is most consistently available and responsive — which is often the stream leader’s server.

  • Low replica count (R1). With R1 streams and consumers, there’s only one copy — the leader. Leadership distribution isn’t possible because there are no followers to promote. This check primarily applies to R3 or R5 configurations.

How to diagnose

Check leader distribution for a stream

Terminal window
# Show stream and consumer leader placement
nats stream report
# Detailed view of a specific stream's consumers
nats consumer list ORDERS

Look at which server hosts the stream leader, then check which servers host each consumer’s leader. If the same server name appears for the stream leader and the majority of consumer leaders, you have co-location.

Get a cluster-wide leadership view

Terminal window
# Show all Raft group leaders across the cluster
nats server report jetstream

This shows how many stream and consumer leaders each server is hosting. A server hosting significantly more leaders than its peers is likely a co-location hotspot.

Identify co-location programmatically

1
package main
2
3
import (
4
"context"
5
"fmt"
6
"log"
7
8
"github.com/nats-io/nats.go"
9
"github.com/nats-io/nats.go/jetstream"
10
)
11
12
func main() {
13
nc, _ := nats.Connect(nats.DefaultURL)
14
js, _ := jetstream.New(nc)
15
16
stream, err := js.Stream(context.Background(), "ORDERS")
17
if err != nil {
18
log.Fatal(err)
19
}
20
si, _ := stream.Info(context.Background())
21
streamLeader := si.Cluster.Leader
22
23
// Count consumer leaders per server
24
serverCounts := make(map[string]int)
25
consLister := stream.ListConsumers(context.Background())
26
total := 0
27
for ci := range consLister.Info() {
28
if ci.Cluster != nil && ci.Cluster.Leader != "" {
29
serverCounts[ci.Cluster.Leader]++
30
total++
31
}
32
}
33
34
fmt.Printf("Stream ORDERS leader: %s\n", streamLeader)
35
fmt.Printf("Consumer leaders:\n")
36
for server, count := range serverCounts {
37
colocated := ""
38
if server == streamLeader {
39
colocated = " ← STREAM LEADER"
40
}
41
fmt.Printf(" %s: %d/%d (%.0f%%)%s\n",
42
server, count, total,
43
float64(count)/float64(total)*100, colocated)
44
}
45
}

How to fix it

Immediate: redistribute consumer leaders

Step down consumer leaders from the co-located server. The nats consumer cluster step-down command forces the current consumer leader to abdicate, triggering a new Raft election that will typically select a different server:

Terminal window
# Step down a specific consumer's leader
nats consumer cluster step-down ORDERS my-consumer
# Step down all consumers on the stream
for consumer in $(nats consumer list ORDERS -n); do
nats consumer cluster step-down ORDERS "$consumer"
sleep 1 # avoid overwhelming the cluster with elections
done

After stepping down, verify the new distribution:

Terminal window
nats consumer list ORDERS

Short-term: step down the stream leader

If consumer leaders keep returning to the same server because it’s the stream leader (and thus has the most up-to-date log), step down the stream leader first:

Terminal window
# Step down the stream leader
nats stream cluster step-down ORDERS

This forces a new stream leader election. The new stream leader may be a different server, which changes the dynamics of subsequent consumer leader elections.

Short-term: use preferred server placement

For consumers that are frequently recreated (e.g., during deployments), you can influence initial leader placement by ensuring the consumer Raft group’s peers are spread across servers. While you can’t directly set a consumer’s leader, you can step down immediately after creation:

Terminal window
# Create consumer, then immediately rebalance if needed
nats consumer add ORDERS new-consumer --pull --filter "orders.>"
nats consumer cluster step-down ORDERS new-consumer

Long-term: automate leader rebalancing

Run periodic rebalancing. Schedule a job that checks leader distribution and steps down leaders on overloaded servers:

Terminal window
# Check if a server hosts >50% of a stream's consumer leaders
# and step down excess leaders
nats server report jetstream --json | \
jq -r '.servers[] | select(.leader_count > .expected_leaders) | .name'

Monitor with Synadia Insights. Insights automatically detects stream-consumer leader co-location across your entire deployment and flags streams where the imbalance exceeds the threshold. This saves you from building and maintaining custom monitoring scripts.

Consider server tags for workload isolation. If certain servers are consistently better suited for stream leadership (e.g., faster disks) and others for consumer leadership, use JetStream placement tags to guide the distribution:

nats-server.conf
1
server_tags: ["role:consumer-heavy"]

Frequently asked questions

Isn’t some co-location inevitable with small clusters?

Yes. In a 3-server cluster with 3 consumers, each server ideally hosts 1 consumer leader. But Raft elections don’t guarantee even distribution. With only 3 consumers, having 2 on one server (67%) exceeds the 50% threshold. The check is most actionable for streams with many consumers (10+), where redistribution has a meaningful impact on load balancing.

Does stepping down a consumer leader cause message loss?

No. A consumer leader step-down triggers a Raft leadership election among the consumer’s replicas. The new leader picks up from the exact same delivery state — pending messages, ack tracking, redelivery timers. Clients connected to the consumer experience a brief pause (typically < 1 second) during the election but no messages are lost or redelivered.

How often should I rebalance?

After any cluster topology change (server added, removed, restarted) and periodically (weekly or monthly) during normal operations. Rebalancing is cheap — each step-down takes milliseconds — so there’s little risk in doing it frequently. The main cost is the brief leadership transition period per consumer.

Does this apply to R1 streams?

No. R1 streams and consumers have only one replica — the leader. There are no followers to promote, so redistribution isn’t possible. If you need better load distribution, consider upgrading to R3 replication, which gives you replicas across 3 servers and the ability to move leaders.

What if I intentionally want co-location for performance?

In some cases, co-locating stream and consumer leaders on the same server reduces network hops for consumer reads. If the stream leader has the data in its page cache, a co-located consumer leader reads locally instead of over the network. This trade-off makes sense for low-consumer-count streams where the co-location benefit outweighs the hotspot risk. In that case, you can safely ignore this check for those specific streams.

Proactive monitoring for NATS stream consumer leader co location with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel