Checks/CONSUMER_002

NATS Consumer Replica Lag: What It Means and How to Fix It

Severity
Warning
Category
Consistency
Applies to
Consumer
Check ID
CONSUMER_002
Detection threshold
consumer replica lag exceeds configured maximum (default: 1,000 operations)

Consumer replica lag means a follower in a replicated JetStream consumer’s Raft group is behind the leader by more than the configured threshold of operations. The lagging replica has stale acknowledgment state — if it becomes leader during a failover, it may redeliver messages that were already acknowledged, causing duplicate processing.

Why this matters

JetStream consumers with replication (R3 or R5) maintain their acknowledgment state across multiple servers via Raft consensus. The leader tracks which messages have been delivered, acknowledged, and are pending redelivery. Followers replicate this state so that any of them can take over as leader if the current leader fails. When a follower falls behind, its view of acknowledged messages is out of date.

The operational risk is during failover. If the leader server goes down and a lagging replica is elected as the new leader, it will redeliver messages it doesn’t know were already acknowledged. For idempotent consumers, this is a minor inefficiency. For consumers that trigger side effects — sending emails, charging credit cards, updating inventory — duplicate processing can cause real business impact. The further behind the replica is, the larger the window of potential redelivery.

Consumer replica lag also serves as an early warning for broader infrastructure issues. A replica that can’t keep up with the leader’s operation rate is often running on a server with disk I/O contention, high CPU load, or network problems. The same conditions that cause consumer replica lag will eventually affect stream replication (JETSTREAM_001), Raft group health, and overall cluster stability. Catching it at the consumer level gives you time to investigate the server before the blast radius expands.

Common causes

  • Disk I/O contention on the follower’s server. Raft logs are persisted to disk. If the follower’s server has slow or saturated disk I/O — shared storage, undersized volumes, noisy neighbors — Raft log writes fall behind the leader’s commit rate.

  • Network latency between leader and follower. Raft replication requires round-trips between the leader and followers. Elevated network latency between the servers delays each replication cycle, allowing operations to accumulate. Cross-region replicas are particularly susceptible.

  • High acknowledgment rate exceeding replication bandwidth. A consumer processing thousands of acks per second generates a corresponding volume of Raft operations. If the follower can’t replicate and persist these operations as fast as the leader produces them, lag grows.

  • Follower recovering after a restart. When a server restarts, its consumer replicas must catch up from where they left off. A consumer that accumulated many operations during the server’s downtime will show significant lag until catch-up completes. This is transient and expected — but persistent lag after catch-up should have completed is not.

  • Server CPU or memory pressure. Raft processing competes with all other work on the server — message routing, other Raft groups, JetStream operations. A server under heavy CPU load or memory pressure will deprioritize Raft follower work, causing lag across multiple consumer (and stream) Raft groups.

  • Too many Raft groups on one server. Each replicated stream and consumer is a separate Raft group. A server participating in hundreds or thousands of Raft groups spends significant CPU time on Raft protocol overhead. Consumer Raft groups on overloaded servers show lag first because they tend to be higher-operation-rate than stream groups.

How to diagnose

Check consumer replica status

Inspect a specific consumer’s replication health:

Terminal window
nats consumer info <stream_name> <consumer_name>

The output includes a Cluster section showing each replica’s state, lag, and last active time. A healthy follower shows low lag and a recent active timestamp. A lagging follower shows a lag count exceeding the threshold and potentially a stale active timestamp.

Report on all consumers for a stream

Terminal window
nats consumer report <stream_name>

This shows all consumers on the stream with their leader, replica count, and lag summary. Sort through the output to find consumers with non-zero lag.

Check the follower server’s health

Once you’ve identified which server is hosting the lagging replica, check its overall health:

Terminal window
# Server-level JetStream report
nats server report jetstream
# Check HA assets count on the server
nats server report jetstream --host <server_name>

Look for:

  • High HA asset count — too many Raft groups on this server (see CLUSTER_003)
  • High CPU — the server is overloaded
  • Storage utilization — disk pressure

Check Raft group health across the cluster

Terminal window
# Broader Raft health
nats server report jetstream

If multiple consumers and streams show replica lag on the same server, the issue is the server, not the individual consumer.

Check network latency between servers

Terminal window
# Route RTT between cluster members
nats server list

Elevated route RTT between the leader’s server and the follower’s server directly impacts Raft replication latency.

How to fix it

Immediate: assess the risk

Determine if the lag is transient or persistent. After a server restart, consumer replicas need time to catch up. Check the lag over two or three collection intervals. If it’s decreasing, the replica is catching up and will resolve on its own. If it’s stable or growing, there’s a structural issue.

Check if the lagging replica is the only follower. For an R3 consumer, one lagging follower out of two is an inconvenience. If both followers are lagging, the consumer has no healthy failover target — the next leader election will produce a leader with stale state regardless.

Terminal window
nats consumer info <stream_name> <consumer_name>

Short-term: address the server-level bottleneck

Relieve disk I/O pressure. If the lagging replica’s server has saturated disk:

Terminal window
# Check disk I/O on the server (Linux)
iostat -xz 1 5

Move other workloads off the server, upgrade to faster storage (NVMe over spinning disk), or reduce the number of Raft groups by moving streams and consumers to other cluster members.

Reduce Raft group count on overloaded servers. If the server has too many HA assets:

Terminal window
# Check per-server HA asset count
nats server report jetstream

Redistribute streams and consumers using placement tags to balance Raft group density across the cluster.

Step down the consumer leader if needed. If the current leader is on a problematic server, force a leadership change to a healthy follower:

Terminal window
nats consumer cluster step-down <stream_name> <consumer_name>

Long-term: prevent replica lag structurally

Right-size your replica count. Not every consumer needs R3. Low-criticality consumers that can tolerate redelivery on failover can run at R1, reducing total Raft group count and replication load across the cluster.

Use placement tags to separate hot consumers. Consumers with high ack rates should be placed on servers with adequate disk I/O and CPU headroom:

1
// Go — create consumer with placement preference
2
js, _ := nc.JetStream()
3
_, err := js.AddConsumer("ORDERS", &nats.ConsumerConfig{
4
Durable: "order-processor",
5
AckPolicy: nats.AckExplicitPolicy,
6
Replicas: 3,
7
})

Monitor replica lag continuously. Set up alerting on consumer replica lag to catch degradation before it becomes a failover risk.

Synadia Insights monitors consumer replica lag automatically every collection epoch, tracking lag trends across all replicated consumers and alerting when any replica exceeds the threshold — no per-consumer alerting configuration required.

Frequently asked questions

What is the difference between consumer replica lag and stream replica lag?

Stream replica lag (JETSTREAM_001) measures how far behind a stream follower is in replicating the message log — the raw data. Consumer replica lag (CONSUMER_002) measures how far behind a follower is in replicating the consumer’s acknowledgment state — which messages have been delivered, acked, and nak’d. They are separate Raft groups with separate replication streams. A server can have healthy stream replication but lagging consumer replication, or vice versa.

Will consumer replica lag cause message loss?

Not directly. Consumer replica lag doesn’t lose messages — the messages are safely stored in the stream. What it risks is duplicate delivery. If a lagging replica becomes leader, it may redeliver messages that the previous leader had already marked as acknowledged. For idempotent consumers this is harmless. For consumers with side effects, it can cause duplicate processing.

How long should I wait for a replica to catch up after a server restart?

It depends on the operation volume accumulated during the server’s downtime and the server’s disk I/O capacity. A consumer with 10,000 pending operations on fast NVMe storage should catch up in seconds. A consumer with millions of operations on slow disk may take minutes. Monitor the lag count over consecutive intervals — it should be consistently decreasing. If lag stabilizes or grows after the server has been running for several minutes, there’s a separate performance issue.

Can I force a lagging replica to re-sync?

There’s no direct command to force a consumer replica re-sync in the same way you can with streams. If a consumer replica is persistently lagging despite the server being healthy, you can delete and recreate the consumer. For durable consumers, this resets delivery state — coordinate with the consuming application to avoid reprocessing. In most cases, addressing the underlying server performance issue is preferable to recreation.

How many Raft groups per server is too many?

There’s no hard limit, but performance degrades as the count grows. The CLUSTER_003 check alerts at 1,000 HA assets (streams + consumers combined) per server. In practice, servers with fast NVMe storage and adequate CPU can handle more, while servers on slower storage show lag well before that threshold. Monitor replica lag across multiple consumers on the same server to detect the inflection point for your hardware.

Proactive monitoring for NATS consumer replica lag with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel