Checks/JETSTREAM_018

NATS Mirror Seen Critical: What It Means and How to Fix It

Severity
Critical
Category
Consistency
Applies to
JetStream
Check ID
JETSTREAM_018
Detection threshold
time since mirror was last active exceeds operator-defined io.nats.monitor.seen-critical threshold

The mirror stream has not received any data from its source within the operator-defined io.nats.monitor.seen-critical time window. Unlike JETSTREAM_015 (Mirror Last Seen Staleness), which uses a built-in 5-minute heuristic, this check fires when inactivity exceeds a threshold you explicitly set — meaning the mirror has been silent longer than you’ve determined is acceptable for this workload.

Why this matters

The “last seen” timestamp on a mirror stream indicates when the internal mirror consumer last received a message from the source. When this timestamp exceeds your critical threshold, the mirror has effectively stopped replicating. Unlike lag-based checks that measure how far behind the mirror is in message count, the seen-critical check measures how long the mirror has been disconnected from the data flow entirely.

This distinction matters for two reasons. First, a mirror can have zero lag and still be stale if the source was idle when the mirror disconnected — lag measures message count difference, not time since last activity. Second, a mirror with high lag is at least still receiving messages (slowly). A mirror that hasn’t been “seen” is receiving nothing at all.

For disaster recovery, the seen-critical threshold defines your maximum tolerable replication blackout. If you set io.nats.monitor.seen-critical to 120 seconds and the mirror hasn’t been seen in 3 minutes, you know the mirror is at least 3 minutes stale — and possibly much more, depending on source publish rate during that window. Every second beyond the threshold increases your potential data loss in a failover.

For compliance and SLA scenarios, the seen-critical threshold provides an auditable guarantee. You can configure it to match your RPO requirements and know that any violation of this threshold will generate an alert, regardless of other monitoring conditions.

Common causes

  • Network connectivity loss between source and mirror clusters. Gateway or leaf node connections between the clusters hosting the source and mirror have dropped. This is the most common cause — the mirror consumer cannot reach the source, so it receives nothing.

  • Source stream deleted or reconfigured. If the source stream is deleted, the mirror has nothing to consume from. Similarly, if the source is reconfigured in a way that invalidates the mirror’s internal consumer (e.g., changing subjects in a way that breaks the mirror filter), replication stops.

  • Authentication or authorization change. If account credentials, tokens, or permissions change on the source cluster, the mirror’s internal consumer may lose authorization to read from the source stream. The connection fails silently — no new messages arrive.

  • Internal mirror consumer failure. The NATS server’s internal consumer that drives mirror replication may crash or enter an unrecoverable error state. This is similar to JETSTREAM_015 but detected by the operator-defined time threshold rather than the built-in heuristic.

  • Source cluster unavailable. If the entire source cluster is down — maintenance, outage, or network isolation — all mirrors sourcing from it will stop receiving messages. This is expected during planned maintenance but should still trigger alerts to confirm mirrors resume afterward.

  • DNS resolution failure for cross-cluster connections. If the mirror uses DNS-based gateway or leaf node URLs and DNS resolution fails or returns stale results, the mirror consumer cannot establish a connection to the source.

How to diagnose

Confirm the inactivity duration

Terminal window
nats stream info MIRROR_STREAM_NAME

Check the Mirror section:

1
Mirror Information:
2
Stream Name: SOURCE_STREAM
3
Lag: 12,847
4
Last Seen: 4m18s

A “Last Seen” exceeding your configured threshold confirms the check condition.

Check the configured threshold

Terminal window
nats stream info MIRROR_STREAM_NAME --json | jq '.config.metadata["io.nats.monitor.seen-critical"]'

This returns the threshold in seconds (e.g., "120" for 2 minutes).

Verify source cluster connectivity

Terminal window
# Check gateway connections to the source cluster
nats server report gateways
# Check leaf node connections if the mirror uses leaf nodes
nats server report leafnodes

If the source cluster doesn’t appear in gateway or leaf node reports, the connectivity path is broken.

Verify the source stream exists and is active

Terminal window
# From a client connected to the source cluster
nats stream info SOURCE_STREAM_NAME

If the source stream is deleted or the command fails, the mirror has nothing to replicate from.

Check for authorization errors

Terminal window
# Look for auth-related errors on the mirror's server
grep -i "authorization\|permission\|auth" /var/log/nats-server.log | tail -20

Programmatic detection

1
package main
2
3
import (
4
"fmt"
5
"log"
6
"strconv"
7
"time"
8
9
"github.com/nats-io/nats.go"
10
)
11
12
func main() {
13
nc, _ := nats.Connect(nats.DefaultURL)
14
js, _ := nc.JetStream()
15
16
for name := range js.StreamNames() {
17
info, err := js.StreamInfo(name)
18
if err != nil {
19
log.Printf("error: %v", err)
20
continue
21
}
22
if info.Mirror == nil {
23
continue
24
}
25
thresholdStr := info.Config.Metadata["io.nats.monitor.seen-critical"]
26
if thresholdStr == "" {
27
continue
28
}
29
thresholdSec, _ := strconv.ParseFloat(thresholdStr, 64)
30
threshold := time.Duration(thresholdSec) * time.Second
31
if info.Mirror.Active > threshold {
32
fmt.Printf("CRITICAL: mirror %s last seen %s ago (threshold: %s)\n",
33
name, info.Mirror.Active.Round(time.Second), threshold)
34
}
35
}
36
}

How to fix it

Immediate: restore connectivity

Check and restore gateway/leaf node connections. If the connectivity path between clusters is broken, restoring it is the first priority:

Terminal window
# Verify gateways are configured and connected
nats server report gateways
# If a gateway is down, check server config and restart if needed
nats server config reload <server-id>

Step down the mirror’s leader. This forces a new leader election and recreation of the internal mirror consumer with a fresh connection to the source:

Terminal window
nats stream cluster step-down MIRROR_STREAM_NAME

After step-down, monitor for resumed activity:

Terminal window
watch -n 5 'nats stream info MIRROR_STREAM_NAME --json | jq "{lag: .mirror.lag, active: .mirror.active_ns}"'

Short-term: address the root cause

If the source stream was deleted or reconfigured: Recreate the source stream or update the mirror configuration to point to the correct source:

Terminal window
# Delete and recreate the mirror with the correct source
nats stream delete MIRROR_STREAM -f
nats stream add MIRROR_STREAM --mirror CORRECT_SOURCE_STREAM

If authentication changed: Update credentials on the mirror cluster to match the source cluster’s current authentication requirements. Reload the server configuration:

Terminal window
nats server config reload <server-id>

If DNS resolution is failing: Verify DNS records for the source cluster’s gateway URLs. Consider using IP-based gateway URLs for critical mirror configurations to eliminate DNS as a failure point.

Long-term: build resilience

Monitor gateway health independently. Don’t rely solely on mirror activity to detect cross-cluster connectivity issues. Monitor gateway connections, RTT, and throughput as separate signals.

Set appropriate seen-critical thresholds. The threshold should be longer than the maximum expected gap between messages on the source stream. If the source publishes at least once per minute, a 120-second threshold is reasonable. If the source publishes in bursts with long idle periods, set the threshold accordingly or use lag-based monitoring instead.

Terminal window
nats stream edit MIRROR_STREAM --metadata "io.nats.monitor.seen-critical=120"

Implement connectivity health checks between clusters. Use NATS service pings or a dedicated heartbeat subject that publishes at a known interval. If the heartbeat stops arriving at the mirror cluster, you know connectivity is broken before any mirror-specific check fires.

Document failover procedures. Since seen-critical violations indicate the mirror is stale, your runbook should include steps to assess data loss before failing over: compare the mirror’s last sequence with the source’s current sequence (if the source is still reachable from another path) to quantify the gap.

Frequently asked questions

How is this different from JETSTREAM_015 (Mirror Last Seen Staleness)?

JETSTREAM_015 uses a built-in 5-minute heuristic and only fires when the mirror reports zero lag (suggesting the internal consumer thinks it’s caught up but has actually stalled). JETSTREAM_018 uses your explicitly configured io.nats.monitor.seen-critical threshold and fires regardless of lag value. JETSTREAM_018 is the operator-defined version — you control when it’s critical.

What’s the relationship between seen-critical and lag-critical?

They measure different dimensions of mirror health. lag-critical (JETSTREAM_017) measures message count difference — how far behind the mirror is. seen-critical (this check) measures time since last activity — how long the mirror has been disconnected. A mirror can have low lag but high seen time (source went idle after mirror caught up, then mirror disconnected). It can also have high lag but low seen time (mirror is connected but slow). Both thresholds should be configured for comprehensive mirror monitoring.

What if the source stream is legitimately idle?

If the source stream has no new messages, the mirror won’t receive any either. However, the NATS internal mirror consumer maintains a heartbeat with the source even when no messages flow. The “last seen” timer resets on heartbeats, not just data messages. If seen-critical fires during a legitimately idle source, the heartbeat itself has stopped — indicating a connectivity issue, not a data flow issue.

Can I set different thresholds for different mirrors?

Yes. The io.nats.monitor.seen-critical threshold is per-stream metadata. Each mirror stream can have its own threshold based on its criticality and RPO requirements:

Terminal window
# Critical payment mirror: 60-second threshold
nats stream edit PAYMENTS_MIRROR --metadata "io.nats.monitor.seen-critical=60"
# Analytics mirror: 10-minute threshold
nats stream edit ANALYTICS_MIRROR --metadata "io.nats.monitor.seen-critical=600"

Should I set seen-critical on all mirror streams?

For any mirror that serves a disaster recovery or real-time read purpose, yes. For mirrors used for batch analytics or non-time-sensitive processing, the built-in JETSTREAM_015 heuristic may be sufficient. The operator-defined threshold is most valuable when you have a specific RPO or SLA to enforce.

Proactive monitoring for NATS mirror seen critical with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel