Checks/JETSTREAM_001

NATS Stream Replica Lag: What It Means and How to Fix It

Severity
Warning
Category
Consistency
Applies to
JetStream
Check ID
JETSTREAM_001
Detection threshold
replica > 10% behind leader sequence

Stream replica lag means a JetStream stream follower’s last sequence number is more than 10% behind the leader’s. The lagging replica has stale data — if a leader election happens while it’s behind, you lose the messages it hasn’t replicated yet.

Why this matters

JetStream uses Raft consensus for stream replication. In an R3 stream, the leader processes all writes and replicates them to two followers. Under normal conditions, followers stay within a few operations of the leader. When a follower falls significantly behind, the replication guarantee weakens.

The most dangerous scenario is a leader failure during replica lag. Raft elects a new leader from the available replicas. If the only available candidate is the lagging replica, it becomes leader with a sequence number lower than the previous leader’s. Every message between the new leader’s last sequence and the old leader’s last sequence is lost — permanently. The stream reports a lower message count, consumers that already processed those messages may see gaps, and any client relying on sequence numbers for ordering or deduplication sees inconsistencies.

Even without a leader election, replica lag degrades the durability guarantees you configured the stream for. An R3 stream is supposed to survive any single node failure without data loss. A lagging replica effectively reduces your replication factor: if the lagging replica is 10,000 messages behind and the leader fails, you’re down to the one up-to-date replica — R1 in practice, R3 in name only. The window of vulnerability scales with the lag.

Common causes

  • Slow storage on the follower. The follower’s disk can’t write replicated entries fast enough to keep up with the leader’s write rate. This is especially common in mixed clusters where some nodes have SSDs and others have HDDs, or where a node’s disk is shared with other I/O-heavy workloads.

  • Network latency between leader and follower. Raft replication happens over the cluster’s internal routes. If the network path between the leader and a specific follower has high latency or packet loss, replication throughput drops. Cross-region replicas are particularly susceptible.

  • CPU or memory pressure on the follower. The follower’s server is under resource contention — high CPU from other streams, memory pressure triggering OS swap, or garbage collection pauses. The Raft apply loop competes with everything else on the server for CPU time.

  • High write rate exceeding replication bandwidth. The stream’s publish rate is so high that the Raft replication pipeline saturates. The leader can accept and commit writes faster than followers can receive and apply them. This typically manifests as all followers lagging, not just one.

  • Follower recovering after restart. When a server restarts, its stream replicas need to catch up from where they left off. If the stream received many messages during the downtime, the follower works through a backlog. This is expected and temporary — the check fires during the catch-up window.

  • Large message payloads. Streams with large messages (hundreds of KB to MB per message) amplify the replication bandwidth requirement. A stream doing 1,000 msg/s at 1KB is 1MB/s of replication traffic. The same rate at 100KB per message is 100MB/s — enough to saturate a 1Gbps link.

How to diagnose

Check stream replica status

Terminal window
nats stream info <stream_name>

Look at the Cluster Information section. Each replica shows:

  • Name — the server hosting the replica
  • Current/Not Current — whether the replica is up to date
  • Lag — number of operations behind the leader
  • Active — how recently the replica communicated with the leader

A replica marked “not current” with a non-zero lag is the one this check is flagging.

Compare lag across all streams

Terminal window
nats stream report

This shows all streams with their cluster information, including replica lag. Sort by lag to find the worst offenders. If multiple streams on the same server show lag, the problem is at the server level (disk, CPU, network) rather than stream-specific.

Check the follower server’s resource usage

Terminal window
nats server report jetstream

Look at the lagging server’s CPU, memory, and storage utilization. High values indicate resource contention. Also check:

Terminal window
# Server-level I/O and system stats
curl -s http://localhost:8222/varz | jq '{cpu, mem, slow_consumers}'

Watch for leader elections

Frequent leader elections often accompany replica lag:

Terminal window
nats event --js-advisory

Leader step-down advisories on the affected stream suggest the cluster is actively trying to find a healthy leader, which can compound the lag problem.

Check network latency between cluster peers

Terminal window
nats server list

Route connections with high RTT between the leader’s server and the lagging follower’s server point to a network-level cause.

How to fix it

Immediate: reduce write pressure

If the stream is under heavy write load and lag is growing, temporarily reduce the publish rate if possible. This gives the followers time to catch up:

Terminal window
# Check current stream state
nats stream info <stream_name>
# If lag is growing, monitor the catch-up
watch -n 5 'nats stream info <stream_name> 2>/dev/null | grep -A5 "Cluster"'

Replicas catch up automatically through Raft replication. If a specific follower is severely behind and auto-catchup has stalled, you can force a leader step-down to trigger a fresh Raft snapshot sync:

Terminal window
nats stream cluster step-down <stream_name>

This forces a new leader election. The previously lagging replica will receive a snapshot from the new leader, which can be faster than replaying individual operations.

As a last resort, if a replica remains persistently lagged despite adequate server resources, use nats stream cluster peer-remove to remove it and then re-add it to force a full snapshot sync. This should only be done when auto-catchup has demonstrably stalled, as it temporarily reduces the effective replication factor.

Short-term: fix resource bottlenecks

Upgrade storage to SSDs. JetStream replication performance is directly tied to disk write latency. If the lagging follower is on an HDD or a slow cloud volume, move it to SSD-backed storage. This is the single most impactful change for persistent replica lag.

Reduce I/O contention. If the server hosts many streams, the aggregate I/O from all streams competes for the same disk. Consider redistributing streams across servers:

Terminal window
# Check per-server stream distribution
nats server report jetstream

Address network issues. For cross-region replicas, ensure adequate bandwidth between sites. For same-datacenter lag, check for network congestion, misconfigured MTU, or failing NICs.

Long-term: design for replication capacity

Size servers for replication overhead. A server hosting R3 stream replicas does roughly 2x the disk I/O of the leader (receiving replicated writes from streams where it’s a follower, plus its own leader writes). Size disk throughput and network bandwidth accordingly.

Use placement tags to control replica location. Ensure replicas land on servers with equivalent hardware:

1
// Go client — create stream with placement constraints
2
_, err := js.AddStream(&nats.StreamConfig{
3
Name: "ORDERS",
4
Subjects: []string{"orders.>"},
5
Replicas: 3,
6
Placement: &nats.Placement{
7
Tags: []string{"ssd", "region-us-east"},
8
},
9
})

Monitor lag continuously. Don’t wait for 10% lag to investigate. Track replica lag as a time-series metric and alert on sustained non-zero lag.

Synadia Insights evaluates replica lag every collection epoch across all streams in your deployment, catching lag before it reaches the threshold where failover data loss becomes likely.

Frequently asked questions

How much replica lag is acceptable?

A small, transient lag (tens of operations) during write bursts is normal and not a concern. The check fires at 10% of the leader’s sequence — that’s the point where failover data loss risk becomes material. For streams with strict durability requirements, any sustained lag above zero warrants investigation. The acceptable threshold depends on your tolerance for potential data loss during an unplanned leader election.

Does replica lag affect read performance?

NATS JetStream clients read from the stream leader, not from followers. Replica lag does not affect read latency or throughput for consumers. The lag only matters for durability: it determines how much data could be lost if the leader fails and a lagging replica takes over.

Will a lagging replica catch up on its own?

Usually, yes — if the cause is transient (server restart, temporary network issue, brief I/O spike). The Raft protocol automatically replicates missing entries to the follower. If the follower has been offline long enough that the leader’s Raft log no longer contains the missing entries, the leader sends a full snapshot instead. Persistent lag that doesn’t resolve indicates an ongoing resource constraint that needs manual intervention.

Can I remove and re-add a replica to fix persistent lag?

Yes, but only as a last resort. Use nats stream cluster peer-remove followed by re-adding to force a full snapshot sync from the leader. This works if the lag was caused by corrupted local state or a stalled catchup, but it puts additional load on the leader during the snapshot transfer and temporarily reduces the effective replication factor. For large streams, this can take significant time. Fix the lagging replica’s resource constraints first — auto-catchup through Raft replication resolves most lag scenarios without manual intervention.

Does increasing the replica count help with lag?

No. Increasing from R3 to R5 adds more followers that all need to keep up with the leader. If the bottleneck is disk I/O or network bandwidth on the lagging server, adding replicas makes the problem worse by increasing total replication traffic. Fix the lagging replica’s resource constraints instead.

Proactive monitoring for NATS stream replica lag with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel