Checks/JETSTREAM_017

NATS Mirror Lag Critical: What It Means and How to Fix It

Severity
Critical
Category
Consistency
Applies to
JetStream
Check ID
JETSTREAM_017
Detection threshold
mirror lag exceeds operator-defined io.nats.monitor.lag-critical threshold

The mirror stream has fallen behind the source by more messages than the operator-defined critical threshold. Unlike the built-in lag heuristics that use percentage-based detection, this check uses the io.nats.monitor.lag-critical metadata value — an explicit threshold you set to define when mirror lag is unacceptable for your workload.

Why this matters

Mirror streams serve two primary purposes: disaster recovery and geographic read offloading. In both cases, the value of the mirror depends entirely on how closely it tracks the source. When lag exceeds your critical threshold, neither purpose is being met.

For disaster recovery, mirror lag directly translates to data loss in a failover. If the source stream becomes unavailable and you fail over to the mirror, every message in the lag window is lost. A mirror that’s 50,000 messages behind means 50,000 messages that your consumers will never see. If those messages represent financial transactions, order events, or audit records, the business impact can be severe.

For geographic read offloading, lag means consumers in the mirror’s region are reading stale data. In event-driven architectures, stale data cascades — a consumer that reads a message 30 seconds late produces an action 30 seconds late, which triggers downstream effects that are all 30+ seconds delayed. For real-time dashboards, alerting systems, or user-facing applications, this latency directly degrades the user experience.

The operator-defined threshold exists precisely because acceptable lag varies by workload. A logging pipeline might tolerate 100,000 messages of lag. A payment processing mirror might be critical at 100. By setting io.nats.monitor.lag-critical, you tell the system what matters for this specific stream, and this check enforces it.

Common causes

  • Network bandwidth saturation between source and mirror. Mirror replication flows through gateway or leaf node connections. If these connections are saturated by other traffic — inter-cluster route messages, client traffic, or other mirror streams — the mirror’s replication throughput drops below the source’s publish rate.

  • High latency between source and mirror clusters. Cross-region mirrors inherently have higher RTT. Each message replication requires a round trip. At 100ms RTT, single-threaded replication caps at ~10 messages/second regardless of bandwidth. The internal mirror consumer uses pipelining to mitigate this, but high latency still reduces peak throughput.

  • Source publish rate exceeds mirror’s write throughput. The mirror server’s disk I/O may be slower than the source’s. If the source writes to NVMe storage and the mirror writes to spinning disks or a congested SAN, the mirror simply can’t persist messages as fast as the source produces them.

  • Mirror server resource contention. The server hosting the mirror stream may be overloaded with other streams, consumers, or Raft operations. CPU, memory, or I/O contention slows the mirror’s internal consumer processing.

  • Large message backlog after mirror restart. If the mirror was offline or stalled (JETSTREAM_015) for an extended period and then resumed, it must catch up on the accumulated backlog. During catchup, lag is expected but should decrease over time. If it plateaus or grows, there’s a sustained throughput mismatch.

How to diagnose

Confirm the lag and threshold

Terminal window
nats stream info MIRROR_STREAM_NAME

The Mirror section shows current lag:

1
Mirror Information:
2
Stream Name: SOURCE_STREAM
3
Lag: 87,432
4
Last Seen: 0.3s

Check the configured threshold:

Terminal window
nats stream info MIRROR_STREAM_NAME --json | jq '.config.metadata["io.nats.monitor.lag-critical"]'

If the lag exceeds this value, the check fires.

Measure replication throughput

Take two snapshots of the mirror’s message count to calculate throughput:

Terminal window
# Snapshot 1
nats stream info MIRROR_STREAM --json | jq '{mirror_msgs: .state.messages, time: now}'
sleep 60
# Snapshot 2
nats stream info MIRROR_STREAM --json | jq '{mirror_msgs: .state.messages, time: now}'

Compare the mirror’s throughput (messages gained per second) with the source’s publish rate. If the mirror’s throughput is lower, lag will continue growing.

Check network health between clusters

Terminal window
# Gateway RTT
nats server report gateways
# Leaf node RTT if applicable
nats server report leafnodes

RTT above 50ms between clusters significantly impacts single-stream mirror throughput. Look for packet loss as well — even 0.1% packet loss can halve TCP throughput.

Check mirror server resource utilization

Terminal window
# JetStream resource usage on the mirror's server
nats server report jetstream
# CPU and memory via server stats
nats server report connections --sort out-bytes

Programmatic detection

1
package main
2
3
import (
4
"fmt"
5
"log"
6
"strconv"
7
8
"github.com/nats-io/nats.go"
9
)
10
11
func main() {
12
nc, _ := nats.Connect(nats.DefaultURL)
13
js, _ := nc.JetStream()
14
15
for name := range js.StreamNames() {
16
info, err := js.StreamInfo(name)
17
if err != nil {
18
log.Printf("error: %v", err)
19
continue
20
}
21
if info.Mirror == nil {
22
continue
23
}
24
thresholdStr := info.Config.Metadata["io.nats.monitor.lag-critical"]
25
if thresholdStr == "" {
26
continue
27
}
28
threshold, _ := strconv.ParseUint(thresholdStr, 10, 64)
29
if info.Mirror.Lag > threshold {
30
fmt.Printf("CRITICAL: mirror %s lag=%d exceeds threshold=%d\n",
31
name, info.Mirror.Lag, threshold)
32
}
33
}
34
}

How to fix it

Immediate: reduce the gap

Check if the lag is decreasing. If the mirror is actively catching up (lag decreasing over time), it may resolve itself. Monitor the trend:

Terminal window
watch -n 5 'nats stream info MIRROR_STREAM --json | jq .mirror.lag'

If lag is stable or growing, the mirror cannot keep up with the source’s publish rate and needs intervention.

Reduce source publish rate temporarily. If possible, throttle publishers to the source stream to let the mirror catch up. This is a short-term measure for critical systems where mirror freshness matters more than source throughput.

Step down the mirror’s leader. A leader transition can reset internal state and re-establish the mirror consumer with fresh connections:

Terminal window
nats stream cluster step-down MIRROR_STREAM_NAME

Short-term: increase mirror throughput

Move the mirror to a server with better I/O. If disk write speed is the bottleneck, migrate the mirror to a server with faster storage:

Terminal window
# Remove the mirror from the slow server
nats stream cluster peer-remove MIRROR_STREAM SLOW_SERVER
# The cluster will automatically place a new replica on an available server
# Or recreate to force specific placement

Reduce competing load on the mirror server. If the server hosts many streams, move some to other servers to free I/O and CPU for the mirror.

Optimize network path. For cross-region mirrors, verify the gateway connection uses the most direct network path. Check for MTU mismatches, TCP window sizing, or intermediate firewalls that may throttle throughput.

Long-term: architect for sustainable replication

Right-size the lag-critical threshold. The threshold should reflect your actual RPO (Recovery Point Objective). If you can tolerate 60 seconds of data loss, set the threshold to your source’s peak publish rate × 60. Too-tight thresholds cause alert fatigue; too-loose thresholds miss genuine problems.

Terminal window
# Set or update the lag-critical threshold
nats stream edit MIRROR_STREAM --metadata "io.nats.monitor.lag-critical=50000"

Use subject filtering to reduce mirror volume. If the mirror doesn’t need all subjects from the source, configure a subject filter on the mirror. There is no --mirror-filter flag on nats stream add; specify the filter via the JSON form of --config, or interactively via nats stream add MIRROR_STREAM -i and answer the mirror-filter prompt:

Terminal window
# JSON form: a stream config with mirror.filter_subject set
cat <<'EOF' > /tmp/mirror.json
{
"name": "MIRROR_STREAM",
"mirror": {
"name": "SOURCE_STREAM",
"filter_subject": "orders.>"
},
"storage": "file",
"num_replicas": 3
}
EOF
nats stream add MIRROR_STREAM --config /tmp/mirror.json

Monitor mirror throughput as a capacity metric. Track the ratio of mirror replication rate to source publish rate. If the ratio approaches 1.0, any spike in source traffic will cause lag. Maintain headroom by ensuring the mirror can sustain at least 2x the average source publish rate.

Frequently asked questions

How do I set the io.nats.monitor.lag-critical threshold?

Set it as stream metadata when creating or editing the stream:

Terminal window
nats stream edit MIRROR_STREAM --metadata "io.nats.monitor.lag-critical=10000"

The value is the maximum acceptable message lag (number of messages). Choose a value based on your RPO and the source’s publish rate.

What if I don’t set a threshold?

If io.nats.monitor.lag-critical is not set in the stream’s metadata, this check (JETSTREAM_017) does not fire. You’ll still get JETSTREAM_001 (Stream Replica Lag) for percentage-based detection, but it won’t reflect your specific operational requirements.

Does mirror lag affect the source stream?

No. Mirror replication is asynchronous and does not apply backpressure to the source. Publishers to the source stream are unaffected by mirror lag. The source continues accepting messages at full speed regardless of mirror state.

Can I have multiple thresholds (warning and critical)?

Yes. Use io.nats.monitor.lag-critical for the critical threshold (this check) and standard monitoring for warning-level lag detection. Synadia Insights evaluates both the operator-defined critical threshold and built-in heuristics.

What’s the difference between mirror lag and replica lag?

Mirror lag is the message count difference between a mirror stream and its source — typically across clusters. Replica lag (JETSTREAM_001) is the difference between a stream’s Raft leader and its in-cluster followers. Mirrors use an internal consumer for async replication; replicas use Raft consensus for synchronous replication. Different mechanisms, different failure modes.

Proactive monitoring for NATS mirror lag critical with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel