Every replica in a JetStream Raft group periodically communicates with the leader — acknowledging replicated messages, responding to heartbeats, and participating in leader elections. The active field in stream info reports the time elapsed since the leader last heard from each replica. When this value exceeds the operator-defined threshold set via io.nats.monitor.peer-seen-critical, this check fires. A replica that has not been seen within the expected window is functionally unresponsive — it may be offline, partitioned, or so severely degraded that it cannot participate in the Raft group.
The “last seen” timestamp is the most direct indicator of whether a replica is alive and participating. Unlike peer lag, which measures how far behind a replica is in operations, the peer-seen metric measures whether the replica is communicating at all. A replica with high lag but recent activity is behind but recovering. A replica with a stale last-seen time is not recovering — it has stopped participating.
An unresponsive replica directly threatens stream availability. In a three-replica stream, quorum requires two of three peers. If one replica has not been seen for an extended period, the stream is running on two peers. A failure of either remaining peer causes quorum loss — the stream becomes read-only, and all publish operations are rejected. The gap between “one replica unresponsive” and “quorum lost” is a single failure event.
The peer-seen check is critical in environments where server failures may not trigger immediate infrastructure alerts. A server may appear healthy at the OS level (responding to pings, accepting TCP connections) while the NATS process or the JetStream subsystem is hung. The peer-seen metric catches this category of silent failure because it measures actual Raft group participation, not just server reachability.
In multi-region deployments, peer-seen staleness can indicate network partition between regions. The replica’s server is fine, but it cannot reach the leader due to an inter-region connectivity issue. These partitions may be transient (resolving in seconds) or sustained (requiring operator intervention). The peer-seen threshold determines how much staleness you tolerate before alerting.
Server hosting the replica is down. The most straightforward cause — the NATS server process has stopped, the machine has crashed, or the container has been evicted. The replica is completely offline and the last-seen time grows continuously.
Network partition between leader and replica. The replica’s server is running, but it cannot communicate with the leader’s server. This can happen due to firewall changes, routing issues, DNS failures, or cloud provider networking problems. The replica may be perfectly healthy locally but unreachable from the leader’s perspective.
NATS server process is hung. The server process is running but unresponsive — a deadlock, excessive GC pressure, or a blocked I/O operation. The server does not respond to Raft heartbeats, causing the leader to mark the replica as stale. This is particularly insidious because infrastructure monitoring may report the server as healthy.
Disk I/O stall on the replica. The replica’s storage device has become extremely slow or unresponsive. The Raft subsystem cannot write to disk, blocking message acknowledgment and heartbeat responses. This is common with network-attached storage (NAS/NFS) when the storage backend experiences an outage.
Server undergoing extended maintenance. A server was stopped for maintenance (OS updates, hardware replacement, disk migration) and has not been restarted. If the maintenance window exceeds the peer-seen threshold, this check correctly fires.
Resource exhaustion. The replica server has run out of file descriptors, memory, or disk space. The NATS process may still be running but cannot function normally, causing Raft responses to time out.
nats stream info MY_STREAMThe cluster section shows each replica’s active duration — the time since the leader last heard from it. A healthy replica shows sub-second values. Any value exceeding your threshold warrants investigation.
For machine-readable output:
nats stream info MY_STREAM --json | jq '.cluster.replicas[] | {name: .name, active: .active, offline: .offline, lag: .lag}'# Ping the servernats rtt --server nats://replica-server:4222
# Check if the server is in the cluster member listnats server lsIf nats rtt fails or times out, the server or its NATS process is unreachable.
# Server-level reportnats server report jetstream
# Check for specific server issuesnats server info REPLICA_SERVER_NAMELook for servers missing from the report or showing abnormal resource usage.
If multiple replicas across different streams are stale, the problem is likely at the server or network level rather than stream-specific:
# Find all streams with stale replicasnats server report jetstream --streams | grep -i offline1import (2 "time"3 "github.com/nats-io/nats.go"4)5
6js, _ := nc.JetStream()7info, _ := js.StreamInfo("MY_STREAM")8
9seenThreshold := 30 * time.Second // your critical threshold10
11if info.Cluster != nil {12 for _, r := range info.Cluster.Replicas {13 if r.Active > seenThreshold {14 log.Printf("CRITICAL: replica %s last seen %s ago (threshold: %s)",15 r.Name, r.Active, seenThreshold)16 }17 if r.Offline {18 log.Printf("CRITICAL: replica %s is marked offline", r.Name)19 }20 }21}Before anything else, determine if the stream still has quorum:
nats stream info MY_STREAM --json | jq '{ leader: .cluster.leader, total_peers: (.cluster.replicas | length) + 1, online_replicas: [.cluster.replicas[] | select(.offline == false)] | length}'If the stream has quorum (a majority of peers are online), writes continue to succeed. You have time to investigate the root cause. If quorum is lost, the stream is read-only and requires urgent intervention.
Restart the server if the issue is a crashed process or a hung state:
# If you have SSH accesssystemctl restart nats-server
# Or if running in Kuberneteskubectl rollout restart deployment nats-serverReplace the peer if the server cannot be restored quickly:
nats stream cluster peer-remove MY_STREAM OFFLINE_SERVER_NAMEJetStream will automatically place a new replica on a healthy server and begin replication.
Diagnose the network path between the leader and replica servers. Check:
Until the partition is resolved, the replica remains stale. If the partition is expected to be long-lived, consider removing the peer and adding a new replica in a reachable location.
Check server-level metrics: disk space, memory usage, file descriptor count, CPU utilization. Resolve the resource exhaustion:
# Check disk space on the serverdf -h /path/to/jetstream/data
# Check open file descriptors (Linux)ls /proc/$(pgrep nats-server)/fd | wc -lIf the server is out of disk space, free space by removing old data or expanding the volume. Then restart the NATS server. The replica will rejoin and begin catching up.
Set the peer-seen threshold:
nats stream edit MY_STREAM \ --metadata "io.nats.monitor.peer-seen-critical=30s"Choose a threshold that balances sensitivity with false-positive tolerance. For most deployments, 30 seconds to 2 minutes is appropriate. A too-short threshold may fire during brief network hiccups; a too-long threshold delays detection of real failures.
Monitor complementary checks. Use peer-seen alongside Peer Lag Critical (JETSTREAM_022) and Offline Replica (JETSTREAM_007) for a complete view of replica health. Peer-seen tells you if the replica is communicating. Peer lag tells you if the replica is keeping up. Offline tells you if the server has been marked down.
Implement automated recovery. In Kubernetes or similar orchestration environments, configure liveness probes that test JetStream health, not just HTTP responsiveness. A server that responds to /healthz but has a hung Raft subsystem should be restarted.
The offline flag in stream info indicates whether the server has been marked as offline by the cluster’s meta-leader — typically after the server has been unreachable for an extended period and removed from the cluster membership. The active (peer-seen) duration is more granular — it shows the exact time since the last Raft communication. A replica can have a stale last-seen time without being marked offline. Peer-seen catches the problem earlier, before the cluster’s built-in offline detection triggers.
Raft uses its own heartbeat and election timeout mechanism. If the leader stops sending heartbeats, replicas will initiate a new leader election after the election timeout (typically 2-4 seconds in NATS). The peer-seen metric is an observability layer on top of Raft internals — it exposes the communication health to operators. It does not directly influence Raft election behavior.
If the last-seen time is stale, the replica is not communicating with the leader. This means it is not receiving new messages from the leader. It may have data up to the point when communication stopped, but nothing after that. The replica’s lag will be growing at the same rate as the stream’s write rate.
For most production deployments, 30 seconds to 2 minutes is a good starting range. Set the threshold based on how quickly you need to detect replica failures versus your tolerance for transient alerts. In latency-sensitive environments where rapid failover is critical, use 10-30 seconds. In environments with known network variability (multi-region, hybrid cloud), use 1-2 minutes to avoid noise from transient connectivity blips.
No. R1 (unreplicated) streams have no replicas, so there are no peers to monitor. The peer-seen check only applies to streams with num_replicas >= 2. For R1 streams, use the Stream Quorum Lost check (JETSTREAM_006) to detect the single peer (leader) going offline.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community