Checks/JETSTREAM_015

NATS Mirror Last Seen Staleness: What It Means and How to Fix It

Severity
Warning
Category
Consistency
Applies to
JetStream
Check ID
JETSTREAM_015
Detection threshold
mirror reports zero lag but no activity for > 5 minutes while source is active

A mirror stream replicates data from a source stream using an internal consumer. When this check fires, the mirror reports zero lag — suggesting it’s caught up — but hasn’t received any activity for over 5 minutes while the source stream continues accepting messages. The internal mirror consumer has stalled, and the mirror is silently falling behind.

Why this matters

Mirror streams are used for cross-cluster replication, disaster recovery, and read offloading. Operators configure mirrors expecting the destination to stay synchronized with the source. The “zero lag” status creates a false sense of health — monitoring that only checks lag values will miss this failure entirely.

The stalled consumer means new messages published to the source are not flowing to the mirror. The divergence grows with every message the source receives. If the mirror is serving read traffic (e.g., consumers in a secondary region), those consumers read increasingly stale data without any indication that the mirror has stopped updating.

For disaster recovery scenarios, the impact is severe. If the source becomes unavailable and operators fail over to the mirror, the mirror’s data is missing everything published since the stall began. Depending on how long the stall persisted, this could be minutes, hours, or days of data loss — all while dashboards showed “zero lag.”

The subtlety of this failure mode is what makes it dangerous. A mirror with non-zero lag is obviously behind and triggers standard lag alerts. A mirror with zero lag and zero activity looks healthy to most monitoring systems. Only by comparing the mirror’s last activity timestamp with the source’s ongoing publish activity can you detect the stall.

Common causes

  • Internal mirror consumer crashed without recovery. The NATS server creates an internal consumer to drive mirror replication. If this consumer encounters an unrecoverable error — such as a corrupt message in the source, a permission change, or an internal panic — it may stop processing without being recreated.

  • Network partition between mirror and source. If the connection to the source stream (especially cross-cluster via gateways or leaf nodes) drops and doesn’t recover cleanly, the mirror consumer can enter a state where it believes it’s caught up to the last known position but has lost the connection to receive new messages.

  • Source stream subject filter mismatch. If the mirror is configured with a subject filter and the source starts publishing to subjects outside that filter, the mirror legitimately receives no new messages — but this looks identical to a stall. The check compares activity timestamps rather than message counts to catch genuine stalls.

  • Leader transition on the mirror stream. After a leader election on the mirror’s Raft group, the new leader must re-establish the mirror consumer. If this re-establishment fails silently, the mirror stops receiving updates but reports the pre-election lag (zero if it was caught up).

  • Resource exhaustion on the mirror server. If the server hosting the mirror’s leader is under memory or CPU pressure, the internal mirror consumer may be deprioritized or blocked by other internal operations (compaction, snapshot, catchup of other streams).

How to diagnose

Confirm the stall

Terminal window
nats stream info MIRROR_STREAM_NAME

In the Mirror section, look for:

1
Mirror Information:
2
Stream Name: SOURCE_STREAM
3
Lag: 0
4
Last Seen: 8m22s

Zero lag combined with a “Last Seen” value greater than 5 minutes while the source is actively receiving messages confirms the stall.

Verify the source is active

Terminal window
nats stream info SOURCE_STREAM_NAME

Check that the source stream’s last_seq is advancing:

Terminal window
# Run twice with a gap to see if messages are arriving
nats stream info SOURCE_STREAM_NAME --json | jq '.state.last_seq'
sleep 10
nats stream info SOURCE_STREAM_NAME --json | jq '.state.last_seq'

If the source sequence is advancing but the mirror’s “Last Seen” keeps growing, the mirror consumer is definitely stalled.

Check for connection issues to the source

Terminal window
# If the mirror is cross-cluster, check gateway connections
nats server report gateways
# Check leaf node connections if applicable
nats server report leafnodes

Inspect server logs on the mirror’s leader

Terminal window
# Look for mirror consumer errors
grep -i "mirror\|internal consumer" /var/log/nats-server.log | tail -30

Common error patterns:

1
[ERR] Mirror consumer for stream MIRROR_STREAM error: ...
2
[WRN] Failed to recreate mirror consumer for MIRROR_STREAM

Programmatic detection

1
package main
2
3
import (
4
"fmt"
5
"log"
6
"time"
7
8
"github.com/nats-io/nats.go"
9
)
10
11
func main() {
12
nc, _ := nats.Connect(nats.DefaultURL)
13
js, _ := nc.JetStream()
14
15
for name := range js.StreamNames() {
16
info, err := js.StreamInfo(name)
17
if err != nil {
18
log.Printf("error: %v", err)
19
continue
20
}
21
if info.Mirror == nil {
22
continue
23
}
24
staleness := time.Since(info.Mirror.Active)
25
if info.Mirror.Lag == 0 && staleness > 5*time.Minute {
26
fmt.Printf("STALE MIRROR: %s — lag=0 but last seen %s ago\n",
27
name, staleness.Round(time.Second))
28
}
29
}
30
}

How to fix it

Immediate: restart the mirror consumer

Perform a leader step-down on the mirror stream. This forces the Raft group to elect a new leader, which recreates the internal mirror consumer:

Terminal window
nats stream cluster step-down MIRROR_STREAM_NAME

After the step-down, verify the mirror resumes:

Terminal window
# Wait a few seconds for the new leader to establish the mirror consumer
sleep 10
nats stream info MIRROR_STREAM_NAME

The “Last Seen” should now show a recent timestamp (< 1s), and the lag may temporarily spike as the mirror catches up on missed messages.

If step-down doesn’t resolve it, the mirror consumer may need a full restart:

Terminal window
# Force recreation by editing the stream (no-op change triggers consumer reset)
nats stream edit MIRROR_STREAM_NAME --description "trigger mirror reset"

Short-term: verify data integrity after recovery

Once the mirror resumes replication, verify the message counts converge:

Terminal window
# Compare source and mirror
echo "Source:" && nats stream info SOURCE_STREAM --json | jq '.state.messages'
echo "Mirror:" && nats stream info MIRROR_STREAM --json | jq '.state.messages'

If the mirror’s message count is lower than the source’s and the gap doesn’t close (because missed messages were already purged from the source by retention policy), you may need to recreate the mirror:

Terminal window
nats stream delete MIRROR_STREAM_NAME -f
nats stream add MIRROR_STREAM_NAME --mirror SOURCE_STREAM_NAME

Long-term: monitor for stalls proactively

Alert on mirror activity age, not just lag. Standard lag-based alerting misses this failure mode entirely. Monitor the active field from the mirror info endpoint.

Ensure cross-cluster connectivity monitoring. If mirrors operate across clusters, monitor gateway and leaf node connections independently. A gateway disconnection that isn’t detected and recovered will cause mirror stalls.

Upgrade the NATS server. Improvements to internal mirror consumer resilience — particularly around automatic recreation after failures — are included in newer server releases.

Frequently asked questions

Is this different from JETSTREAM_017 (Mirror Lag Critical)?

Yes. Mirror Lag Critical (JETSTREAM_017) fires when the mirror has non-zero lag exceeding a threshold — the mirror knows it’s behind and is trying to catch up. Mirror Last Seen Staleness (this check) fires when the mirror reports zero lag but has stopped receiving updates entirely. JETSTREAM_017 is a throughput problem; JETSTREAM_015 is a connectivity/consumer problem.

Could a legitimately idle source trigger this check?

This check compares mirror activity against source activity. If the source stream is also idle (no new messages), the mirror’s inactivity is expected and this check does not fire. The check only triggers when the source is actively receiving messages but the mirror isn’t.

Does this affect consumers reading from the mirror?

Consumers attached to the mirror continue reading whatever data the mirror already has. They won’t receive new messages until the mirror resumes replication. If consumers have read all available messages, they appear idle — waiting for messages that should be arriving but aren’t.

What happens to messages published during the stall?

Messages published to the source during the stall are retained by the source according to its retention policy. When the mirror consumer restarts, it resumes from its last position and catches up on the backlog. If the source’s retention policy (e.g., max_age: 1h) has already purged messages published during the stall, those messages are permanently lost from the mirror.

Can I prevent stalls by increasing mirror resources?

Stalls are typically not caused by resource constraints on the mirror itself. They’re caused by connectivity issues or internal consumer failures. Increasing CPU or memory on the mirror server won’t prevent a network partition or a consumer crash. Focus on connectivity monitoring and server upgrades instead.

Proactive monitoring for NATS mirror last seen staleness with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel