The mirror stream has fallen behind the source by more messages than the operator-defined critical threshold. Unlike the built-in lag heuristics that use percentage-based detection, this check uses the io.nats.monitor.lag-critical metadata value — an explicit threshold you set to define when mirror lag is unacceptable for your workload.
Mirror streams serve two primary purposes: disaster recovery and geographic read offloading. In both cases, the value of the mirror depends entirely on how closely it tracks the source. When lag exceeds your critical threshold, neither purpose is being met.
For disaster recovery, mirror lag directly translates to data loss in a failover. If the source stream becomes unavailable and you fail over to the mirror, every message in the lag window is lost. A mirror that’s 50,000 messages behind means 50,000 messages that your consumers will never see. If those messages represent financial transactions, order events, or audit records, the business impact can be severe.
For geographic read offloading, lag means consumers in the mirror’s region are reading stale data. In event-driven architectures, stale data cascades — a consumer that reads a message 30 seconds late produces an action 30 seconds late, which triggers downstream effects that are all 30+ seconds delayed. For real-time dashboards, alerting systems, or user-facing applications, this latency directly degrades the user experience.
The operator-defined threshold exists precisely because acceptable lag varies by workload. A logging pipeline might tolerate 100,000 messages of lag. A payment processing mirror might be critical at 100. By setting io.nats.monitor.lag-critical, you tell the system what matters for this specific stream, and this check enforces it.
Network bandwidth saturation between source and mirror. Mirror replication flows through gateway or leaf node connections. If these connections are saturated by other traffic — inter-cluster route messages, client traffic, or other mirror streams — the mirror’s replication throughput drops below the source’s publish rate.
High latency between source and mirror clusters. Cross-region mirrors inherently have higher RTT. Each message replication requires a round trip. At 100ms RTT, single-threaded replication caps at ~10 messages/second regardless of bandwidth. The internal mirror consumer uses pipelining to mitigate this, but high latency still reduces peak throughput.
Source publish rate exceeds mirror’s write throughput. The mirror server’s disk I/O may be slower than the source’s. If the source writes to NVMe storage and the mirror writes to spinning disks or a congested SAN, the mirror simply can’t persist messages as fast as the source produces them.
Mirror server resource contention. The server hosting the mirror stream may be overloaded with other streams, consumers, or Raft operations. CPU, memory, or I/O contention slows the mirror’s internal consumer processing.
Large message backlog after mirror restart. If the mirror was offline or stalled (JETSTREAM_015) for an extended period and then resumed, it must catch up on the accumulated backlog. During catchup, lag is expected but should decrease over time. If it plateaus or grows, there’s a sustained throughput mismatch.
nats stream info MIRROR_STREAM_NAMEThe Mirror section shows current lag:
1Mirror Information:2 Stream Name: SOURCE_STREAM3 Lag: 87,4324 Last Seen: 0.3sCheck the configured threshold:
nats stream info MIRROR_STREAM_NAME --json | jq '.config.metadata["io.nats.monitor.lag-critical"]'If the lag exceeds this value, the check fires.
Take two snapshots of the mirror’s message count to calculate throughput:
# Snapshot 1nats stream info MIRROR_STREAM --json | jq '{mirror_msgs: .state.messages, time: now}'sleep 60# Snapshot 2nats stream info MIRROR_STREAM --json | jq '{mirror_msgs: .state.messages, time: now}'Compare the mirror’s throughput (messages gained per second) with the source’s publish rate. If the mirror’s throughput is lower, lag will continue growing.
# Gateway RTTnats server report gateways
# Leaf node RTT if applicablenats server report leafnodesRTT above 50ms between clusters significantly impacts single-stream mirror throughput. Look for packet loss as well — even 0.1% packet loss can halve TCP throughput.
# JetStream resource usage on the mirror's servernats server report jetstream
# CPU and memory via server statsnats server report connections --sort out-bytes1package main2
3import (4 "fmt"5 "log"6 "strconv"7
8 "github.com/nats-io/nats.go"9)10
11func main() {12 nc, _ := nats.Connect(nats.DefaultURL)13 js, _ := nc.JetStream()14
15 for name := range js.StreamNames() {16 info, err := js.StreamInfo(name)17 if err != nil {18 log.Printf("error: %v", err)19 continue20 }21 if info.Mirror == nil {22 continue23 }24 thresholdStr := info.Config.Metadata["io.nats.monitor.lag-critical"]25 if thresholdStr == "" {26 continue27 }28 threshold, _ := strconv.ParseUint(thresholdStr, 10, 64)29 if info.Mirror.Lag > threshold {30 fmt.Printf("CRITICAL: mirror %s lag=%d exceeds threshold=%d\n",31 name, info.Mirror.Lag, threshold)32 }33 }34}Check if the lag is decreasing. If the mirror is actively catching up (lag decreasing over time), it may resolve itself. Monitor the trend:
watch -n 5 'nats stream info MIRROR_STREAM --json | jq .mirror.lag'If lag is stable or growing, the mirror cannot keep up with the source’s publish rate and needs intervention.
Reduce source publish rate temporarily. If possible, throttle publishers to the source stream to let the mirror catch up. This is a short-term measure for critical systems where mirror freshness matters more than source throughput.
Step down the mirror’s leader. A leader transition can reset internal state and re-establish the mirror consumer with fresh connections:
nats stream cluster step-down MIRROR_STREAM_NAMEMove the mirror to a server with better I/O. If disk write speed is the bottleneck, migrate the mirror to a server with faster storage:
# Remove the mirror from the slow servernats stream cluster peer-remove MIRROR_STREAM SLOW_SERVER
# The cluster will automatically place a new replica on an available server# Or recreate to force specific placementReduce competing load on the mirror server. If the server hosts many streams, move some to other servers to free I/O and CPU for the mirror.
Optimize network path. For cross-region mirrors, verify the gateway connection uses the most direct network path. Check for MTU mismatches, TCP window sizing, or intermediate firewalls that may throttle throughput.
Right-size the lag-critical threshold. The threshold should reflect your actual RPO (Recovery Point Objective). If you can tolerate 60 seconds of data loss, set the threshold to your source’s peak publish rate × 60. Too-tight thresholds cause alert fatigue; too-loose thresholds miss genuine problems.
# Set or update the lag-critical thresholdnats stream edit MIRROR_STREAM --metadata "io.nats.monitor.lag-critical=50000"Use subject filtering to reduce mirror volume. If the mirror doesn’t need all subjects from the source, configure a subject filter on the mirror. There is no --mirror-filter flag on nats stream add; specify the filter via the JSON form of --config, or interactively via nats stream add MIRROR_STREAM -i and answer the mirror-filter prompt:
# JSON form: a stream config with mirror.filter_subject setcat <<'EOF' > /tmp/mirror.json{ "name": "MIRROR_STREAM", "mirror": { "name": "SOURCE_STREAM", "filter_subject": "orders.>" }, "storage": "file", "num_replicas": 3}EOFnats stream add MIRROR_STREAM --config /tmp/mirror.jsonMonitor mirror throughput as a capacity metric. Track the ratio of mirror replication rate to source publish rate. If the ratio approaches 1.0, any spike in source traffic will cause lag. Maintain headroom by ensuring the mirror can sustain at least 2x the average source publish rate.
Set it as stream metadata when creating or editing the stream:
nats stream edit MIRROR_STREAM --metadata "io.nats.monitor.lag-critical=10000"The value is the maximum acceptable message lag (number of messages). Choose a value based on your RPO and the source’s publish rate.
If io.nats.monitor.lag-critical is not set in the stream’s metadata, this check (JETSTREAM_017) does not fire. You’ll still get JETSTREAM_001 (Stream Replica Lag) for percentage-based detection, but it won’t reflect your specific operational requirements.
No. Mirror replication is asynchronous and does not apply backpressure to the source. Publishers to the source stream are unaffected by mirror lag. The source continues accepting messages at full speed regardless of mirror state.
Yes. Use io.nats.monitor.lag-critical for the critical threshold (this check) and standard monitoring for warning-level lag detection. Synadia Insights evaluates both the operator-defined critical threshold and built-in heuristics.
Mirror lag is the message count difference between a mirror stream and its source — typically across clusters. Replica lag (JETSTREAM_001) is the difference between a stream’s Raft leader and its in-cluster followers. Mirrors use an internal consumer for async replication; replicas use Raft consensus for synchronous replication. Different mechanisms, different failure modes.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community