Checks/JETSTREAM_023

NATS Peer Seen Critical: Detecting Unresponsive Stream Replicas

Severity
Critical
Category
Consistency
Applies to
JetStream
Check ID
JETSTREAM_023
Detection threshold
time since replica last active exceeds operator-defined io.nats.monitor.peer-seen-critical value

Every replica in a JetStream Raft group periodically communicates with the leader — acknowledging replicated messages, responding to heartbeats, and participating in leader elections. The active field in stream info reports the time elapsed since the leader last heard from each replica. When this value exceeds the operator-defined threshold set via io.nats.monitor.peer-seen-critical, this check fires. A replica that has not been seen within the expected window is functionally unresponsive — it may be offline, partitioned, or so severely degraded that it cannot participate in the Raft group.

Why this matters

The “last seen” timestamp is the most direct indicator of whether a replica is alive and participating. Unlike peer lag, which measures how far behind a replica is in operations, the peer-seen metric measures whether the replica is communicating at all. A replica with high lag but recent activity is behind but recovering. A replica with a stale last-seen time is not recovering — it has stopped participating.

An unresponsive replica directly threatens stream availability. In a three-replica stream, quorum requires two of three peers. If one replica has not been seen for an extended period, the stream is running on two peers. A failure of either remaining peer causes quorum loss — the stream becomes read-only, and all publish operations are rejected. The gap between “one replica unresponsive” and “quorum lost” is a single failure event.

The peer-seen check is critical in environments where server failures may not trigger immediate infrastructure alerts. A server may appear healthy at the OS level (responding to pings, accepting TCP connections) while the NATS process or the JetStream subsystem is hung. The peer-seen metric catches this category of silent failure because it measures actual Raft group participation, not just server reachability.

In multi-region deployments, peer-seen staleness can indicate network partition between regions. The replica’s server is fine, but it cannot reach the leader due to an inter-region connectivity issue. These partitions may be transient (resolving in seconds) or sustained (requiring operator intervention). The peer-seen threshold determines how much staleness you tolerate before alerting.

Common causes

  • Server hosting the replica is down. The most straightforward cause — the NATS server process has stopped, the machine has crashed, or the container has been evicted. The replica is completely offline and the last-seen time grows continuously.

  • Network partition between leader and replica. The replica’s server is running, but it cannot communicate with the leader’s server. This can happen due to firewall changes, routing issues, DNS failures, or cloud provider networking problems. The replica may be perfectly healthy locally but unreachable from the leader’s perspective.

  • NATS server process is hung. The server process is running but unresponsive — a deadlock, excessive GC pressure, or a blocked I/O operation. The server does not respond to Raft heartbeats, causing the leader to mark the replica as stale. This is particularly insidious because infrastructure monitoring may report the server as healthy.

  • Disk I/O stall on the replica. The replica’s storage device has become extremely slow or unresponsive. The Raft subsystem cannot write to disk, blocking message acknowledgment and heartbeat responses. This is common with network-attached storage (NAS/NFS) when the storage backend experiences an outage.

  • Server undergoing extended maintenance. A server was stopped for maintenance (OS updates, hardware replacement, disk migration) and has not been restarted. If the maintenance window exceeds the peer-seen threshold, this check correctly fires.

  • Resource exhaustion. The replica server has run out of file descriptors, memory, or disk space. The NATS process may still be running but cannot function normally, causing Raft responses to time out.

How to diagnose

Check replica last-seen times

Terminal window
nats stream info MY_STREAM

The cluster section shows each replica’s active duration — the time since the leader last heard from it. A healthy replica shows sub-second values. Any value exceeding your threshold warrants investigation.

For machine-readable output:

Terminal window
nats stream info MY_STREAM --json | jq '.cluster.replicas[] | {name: .name, active: .active, offline: .offline, lag: .lag}'

Determine if the replica server is reachable

Terminal window
# Ping the server
nats rtt --server nats://replica-server:4222
# Check if the server is in the cluster member list
nats server ls

If nats rtt fails or times out, the server or its NATS process is unreachable.

Check server health

Terminal window
# Server-level report
nats server report jetstream
# Check for specific server issues
nats server info REPLICA_SERVER_NAME

Look for servers missing from the report or showing abnormal resource usage.

Check for broader cluster issues

If multiple replicas across different streams are stale, the problem is likely at the server or network level rather than stream-specific:

Terminal window
# Find all streams with stale replicas
nats server report jetstream --streams | grep -i offline

Programmatic detection in Go

1
import (
2
"time"
3
"github.com/nats-io/nats.go"
4
)
5
6
js, _ := nc.JetStream()
7
info, _ := js.StreamInfo("MY_STREAM")
8
9
seenThreshold := 30 * time.Second // your critical threshold
10
11
if info.Cluster != nil {
12
for _, r := range info.Cluster.Replicas {
13
if r.Active > seenThreshold {
14
log.Printf("CRITICAL: replica %s last seen %s ago (threshold: %s)",
15
r.Name, r.Active, seenThreshold)
16
}
17
if r.Offline {
18
log.Printf("CRITICAL: replica %s is marked offline", r.Name)
19
}
20
}
21
}

Programmatic detection in Python

How to fix it

Immediate: assess quorum status

Before anything else, determine if the stream still has quorum:

Terminal window
nats stream info MY_STREAM --json | jq '{
leader: .cluster.leader,
total_peers: (.cluster.replicas | length) + 1,
online_replicas: [.cluster.replicas[] | select(.offline == false)] | length
}'

If the stream has quorum (a majority of peers are online), writes continue to succeed. You have time to investigate the root cause. If quorum is lost, the stream is read-only and requires urgent intervention.

If the server is down: restart or replace

Restart the server if the issue is a crashed process or a hung state:

Terminal window
# If you have SSH access
systemctl restart nats-server
# Or if running in Kubernetes
kubectl rollout restart deployment nats-server

Replace the peer if the server cannot be restored quickly:

Terminal window
nats stream cluster peer-remove MY_STREAM OFFLINE_SERVER_NAME

JetStream will automatically place a new replica on a healthy server and begin replication.

If the network is partitioned

Diagnose the network path between the leader and replica servers. Check:

  • Firewall rules and security group configurations
  • Route tables and DNS resolution
  • Cloud provider status pages for networking incidents

Until the partition is resolved, the replica remains stale. If the partition is expected to be long-lived, consider removing the peer and adding a new replica in a reachable location.

If the server is hung or resource-exhausted

Check server-level metrics: disk space, memory usage, file descriptor count, CPU utilization. Resolve the resource exhaustion:

Terminal window
# Check disk space on the server
df -h /path/to/jetstream/data
# Check open file descriptors (Linux)
ls /proc/$(pgrep nats-server)/fd | wc -l

If the server is out of disk space, free space by removing old data or expanding the volume. Then restart the NATS server. The replica will rejoin and begin catching up.

Preventive: configure monitoring

Set the peer-seen threshold:

Terminal window
nats stream edit MY_STREAM \
--metadata "io.nats.monitor.peer-seen-critical=30s"

Choose a threshold that balances sensitivity with false-positive tolerance. For most deployments, 30 seconds to 2 minutes is appropriate. A too-short threshold may fire during brief network hiccups; a too-long threshold delays detection of real failures.

Monitor complementary checks. Use peer-seen alongside Peer Lag Critical (JETSTREAM_022) and Offline Replica (JETSTREAM_007) for a complete view of replica health. Peer-seen tells you if the replica is communicating. Peer lag tells you if the replica is keeping up. Offline tells you if the server has been marked down.

Implement automated recovery. In Kubernetes or similar orchestration environments, configure liveness probes that test JetStream health, not just HTTP responsiveness. A server that responds to /healthz but has a hung Raft subsystem should be restarted.

Frequently asked questions

What is the difference between peer-seen and offline status?

The offline flag in stream info indicates whether the server has been marked as offline by the cluster’s meta-leader — typically after the server has been unreachable for an extended period and removed from the cluster membership. The active (peer-seen) duration is more granular — it shows the exact time since the last Raft communication. A replica can have a stale last-seen time without being marked offline. Peer-seen catches the problem earlier, before the cluster’s built-in offline detection triggers.

How does peer-seen relate to Raft election timeouts?

Raft uses its own heartbeat and election timeout mechanism. If the leader stops sending heartbeats, replicas will initiate a new leader election after the election timeout (typically 2-4 seconds in NATS). The peer-seen metric is an observability layer on top of Raft internals — it exposes the communication health to operators. It does not directly influence Raft election behavior.

Can a replica with a stale last-seen still have recent data?

If the last-seen time is stale, the replica is not communicating with the leader. This means it is not receiving new messages from the leader. It may have data up to the point when communication stopped, but nothing after that. The replica’s lag will be growing at the same rate as the stream’s write rate.

What threshold should I use for peer-seen-critical?

For most production deployments, 30 seconds to 2 minutes is a good starting range. Set the threshold based on how quickly you need to detect replica failures versus your tolerance for transient alerts. In latency-sensitive environments where rapid failover is critical, use 10-30 seconds. In environments with known network variability (multi-region, hybrid cloud), use 1-2 minutes to avoid noise from transient connectivity blips.

Does this check fire for R1 streams?

No. R1 (unreplicated) streams have no replicas, so there are no peers to monitor. The peer-seen check only applies to streams with num_replicas >= 2. For R1 streams, use the Stream Quorum Lost check (JETSTREAM_006) to detect the single peer (leader) going offline.

Proactive monitoring for NATS peer seen critical with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel