Checks/JETSTREAM_022

NATS Peer Lag Critical: Detecting Stream Replica Replication Lag

Severity
Critical
Category
Consistency
Applies to
JetStream
Check ID
JETSTREAM_022
Detection threshold
replica lag exceeds operator-defined io.nats.monitor.peer-lag-critical value

In a replicated JetStream stream, each follower replica receives messages from the leader and appends them to its local store. Peer lag is the number of operations (messages) a replica is behind the leader. When a replica’s lag exceeds the operator-defined threshold set via io.nats.monitor.peer-lag-critical, this check fires. High peer lag means one or more replicas do not have a current copy of the stream’s data — creating a window where a leader failure could result in acknowledged messages being lost or temporarily unavailable.

Why this matters

JetStream replication is synchronous for quorum — the leader waits for a majority of replicas to acknowledge a write before confirming it to the publisher. However, non-quorum replicas are allowed to fall behind. In a three-replica stream, the leader and one replica form quorum. The third replica can lag without affecting write acknowledgments. This means lag on a minority of replicas is invisible to publishers and consumers under normal operation.

The risk emerges during failures. If the leader fails, one of the replicas becomes the new leader. If the lagging replica is the only survivor, the new leader is behind — it does not have the most recent messages. Consumers see a gap. Publishers that received acknowledgments for messages the old leader committed may find those messages missing on the new leader. This is the fundamental danger of peer lag: it reduces the effective durability guarantee of the stream.

Even without leader failure, high lag indicates a systemic problem. The lagging replica is not keeping up with the stream’s write rate. If the lag is growing rather than stable, the replica will eventually fall so far behind that it requires a full snapshot to recover rather than incremental catch-up — a much more expensive operation that can temporarily impact the leader’s performance.

Sustained replication lag also affects read-after-write consistency for consumers that happen to be connected to the lagging replica’s server. While JetStream consumers typically follow the leader, direct access patterns and certain API calls may return stale data from a lagging replica.

Common causes

  • Resource contention on the replica server. The server hosting the lagging replica is under CPU, memory, or disk I/O pressure. Slow disk writes are the most common bottleneck — the replica cannot flush incoming messages to storage as fast as they arrive. This is especially prevalent when the server hosts many streams or when other workloads share the same disk.

  • Network latency or packet loss between leader and replica. The Raft replication protocol sends message batches from the leader to each replica. Network degradation between the leader’s server and the replica’s server slows replication. Even modest packet loss (0.1%) can significantly reduce replication throughput at high message rates.

  • High write rate exceeding replica throughput. A burst of publishes — batch imports, backfills, or traffic spikes — pushes the stream’s write rate above what the replica can sustain. The leader queues operations for the replica, and lag grows until the burst subsides and the replica can catch up.

  • Large message payloads. Streams with large messages (>100KB) generate proportionally more disk I/O and network traffic per operation. A replica that keeps up fine with small messages may struggle when message sizes increase.

  • Disk I/O saturation. The replica server’s storage is the bottleneck. Multiple streams sharing the same volume, combined with heavy reads (consumer replay) and writes (replication), saturate disk bandwidth. SSDs mitigate this but are not immune under extreme load.

  • JetStream storage compaction or snapshot. When the leader performs a stream compaction or sends a snapshot to a far-behind replica, both operations consume significant I/O. During these windows, other replicas on the same server may also experience increased lag.

How to diagnose

Check replica lag for a specific stream

Terminal window
nats stream info MY_STREAM

The cluster section shows each replica’s lag value. The lag number represents operations (messages) behind the leader. A lag of 0 means the replica is fully caught up.

For a machine-readable view:

Terminal window
nats stream info MY_STREAM --json | jq '.cluster.replicas[] | {name: .name, lag: .lag, active: .active, offline: .offline}'

Identify lagging replicas across all streams

Terminal window
nats server report jetstream --streams

This shows all streams and their replica states. Sort by lag to find the worst offenders.

Determine if lag is growing or stable

Check lag at two points in time. If the lag value is increasing between checks, the replica is falling further behind. If it is stable, the replica is keeping up with the current rate but has a historical gap to close.

Terminal window
# Check now
nats stream info MY_STREAM --json | jq '.cluster.replicas[] | {name: .name, lag: .lag}'
# Wait 30 seconds, check again
sleep 30
nats stream info MY_STREAM --json | jq '.cluster.replicas[] | {name: .name, lag: .lag}'

Check for resource contention on the replica server

Terminal window
# Server-level resource report
nats server report connections --sort in-msgs
# Check if the replica server is under memory pressure
nats server report jetstream

Look for servers with high CPU usage, high memory utilization, or many streams placed on a single server.

Programmatic detection in Go

1
js, _ := nc.JetStream()
2
info, _ := js.StreamInfo("MY_STREAM")
3
4
lagThreshold := uint64(10000) // your critical threshold
5
6
if info.Cluster != nil {
7
for _, r := range info.Cluster.Replicas {
8
if r.Lag > lagThreshold {
9
log.Printf("CRITICAL: replica %s lag=%d exceeds threshold=%d",
10
r.Name, r.Lag, lagThreshold)
11
}
12
}
13
}

Programmatic detection in Python

How to fix it

Immediate: reduce write pressure

If the lag is caused by a traffic burst, consider temporarily throttling publishers or pausing non-critical write workloads to give the replica time to catch up. This is a triage action, not a solution.

Short-term: address the bottleneck

Disk I/O saturation. Move the lagging replica’s stream data to faster storage. SSDs are strongly recommended for JetStream workloads. If the server is already on SSDs, check for noisy neighbors — other streams or processes competing for I/O.

Terminal window
# Check server-level JetStream storage usage
nats server report jetstream

Network issues. Check route-level health between cluster members:

Terminal window
nats server list
nats rtt --server nats://replica-server:4222

If RTT is elevated or unstable, investigate the network path between the leader and replica servers.

Server overload. If the replica server hosts too many streams, redistribute streams across the cluster:

Terminal window
nats stream cluster peer-remove MY_STREAM OVERLOADED_SERVER

JetStream will place a new replica on a less loaded server.

Long-term: prevent recurrence

Set the monitoring threshold. Define the io.nats.monitor.peer-lag-critical metadata tag so this check fires before lag becomes dangerous:

Terminal window
nats stream edit MY_STREAM \
--metadata "io.nats.monitor.peer-lag-critical=10000"

Choose a threshold based on your stream’s write rate and your tolerance for potential data gap during failover. A common heuristic: set the threshold to the number of messages published in 30–60 seconds at peak rate.

Use placement constraints. Distribute replicas across servers with balanced resource profiles:

Terminal window
nats stream edit MY_STREAM --tag "storage:nvme"

Right-size your cluster. Ensure sufficient servers with adequate disk and network capacity to handle the aggregate write rate of all streams. Plan for peak, not average.

Monitor Raft health broadly. Peer lag is one symptom of Raft group health issues. Also monitor the Raft Apply Lag (OPT_SYS_007) and Raft Sustained Catching Up (OPT_SYS_013) checks for a complete picture of replication health.

Frequently asked questions

How much lag is normal?

For a healthy stream, replicas should have a lag of 0 or near-zero most of the time. Transient lag during traffic spikes is expected — a few hundred operations of lag that resolves within seconds is normal. Sustained lag of thousands or more operations, or lag that grows over time, indicates a systemic problem that needs attention. The appropriate threshold depends on your stream’s write rate and your durability requirements.

Does peer lag affect consumers?

Not directly in most cases. JetStream consumers follow the stream leader for reads, so they see the latest committed data. However, if the leader fails and a lagging replica becomes the new leader, consumers may temporarily see a sequence gap for the uncommitted operations. Additionally, certain API calls that are served by the local server (rather than forwarded to the leader) may return stale data from a lagging replica.

Can I force a lagging replica to catch up faster?

There is no direct mechanism to increase a specific replica’s catch-up speed. The leader sends data as fast as the replica can acknowledge it. To speed up catch-up, remove the bottleneck: free up disk I/O on the replica server, reduce network latency, or reduce competing workloads. If the lag is so large that incremental catch-up is impractical, removing and re-adding the peer forces a fresh snapshot-based replication, which may be faster for very large gaps.

What happens if lag exceeds the stream’s retention window?

If a replica falls so far behind that the leader has discarded the lagging operations (due to retention policy), the replica cannot catch up incrementally. It will need a full stream snapshot from the leader to resynchronize. This is a heavyweight operation that temporarily increases the leader’s disk and network usage. Monitoring peer lag prevents this scenario by alerting before the gap becomes unrecoverable through normal replication.

Should I set the same lag threshold for all streams?

Not necessarily. High-throughput streams will naturally experience more transient lag than low-throughput streams. Set thresholds based on each stream’s write rate and criticality. A low-volume audit stream might use a threshold of 100, while a high-volume telemetry stream might use 50,000. The key metric is: how many seconds of data does the lag represent at current write rates?

Proactive monitoring for NATS peer lag critical with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel