Checks/OPT_SYS_013

NATS Raft Sustained Catching Up: What It Means and How to Fix It

Severity
Warning
Category
Health
Applies to
System Improvement
Check ID
OPT_SYS_013
Detection threshold
Raft group member in catching_up state

A Raft sustained catching up condition occurs when a member of a NATS JetStream Raft group is in the catching_up state, meaning it is replaying missed log entries or receiving a snapshot from the leader to rebuild its local state. Brief catch-up after a restart is normal — sustained catching up indicates the follower cannot close the gap with the leader.

Why this matters

When a Raft group member is catching up, it is not a fully functional replica. It cannot vote in leader elections, which reduces the effective quorum safety margin. In an R3 configuration, one member catching up means only two members can vote — any additional failure loses quorum entirely, stalling the stream or consumer. The system is operating without its intended fault tolerance for the entire duration of the catch-up.

The catch-up process itself consumes resources on both the leader and the follower. The leader must read historical data from its storage engine and transmit it to the catching-up follower, competing with live client traffic for disk I/O and network bandwidth. On high-throughput streams, this creates a race condition: the leader is sending historical data while simultaneously receiving and replicating new messages. If the follower can’t ingest the historical data faster than the leader produces new data, the catch-up never completes — the gap stays constant or grows.

Sustained catching up is a leading indicator of more severe failures. If the underlying cause isn’t addressed, the catching-up member may eventually be marked offline (META_001), the Raft group may lose quorum (JETSTREAM_008), and the stream or consumer becomes unavailable for writes. Catching up that persists for more than a few minutes warrants immediate investigation.

Common causes

  • Large stream requiring full snapshot transfer. When a Raft follower has been offline long enough that the leader’s log no longer contains the entries it missed (they’ve been compacted), the leader must send a full snapshot of the stream’s state. For a 100GB stream, this snapshot transfer can take hours over a 1Gbps network link, during which the member remains in catching-up state.

  • Slow disk on the follower. The catching-up follower must write incoming data to its storage engine. If the follower’s disk is slower than the rate at which the leader sends data — or if the disk is contended by other Raft groups on the same server — the follower falls further behind instead of catching up. Network-attached storage with high write latency is a common bottleneck.

  • Insufficient network bandwidth between leader and follower. Catch-up data competes with live Raft replication traffic and client connections for network bandwidth. If the link between servers is saturated or bandwidth-limited (common in cross-region deployments), catch-up throughput is throttled below what’s needed to close the gap.

  • Leader under heavy write load during catch-up. A stream receiving thousands of messages per second while simultaneously serving catch-up data puts dual I/O pressure on the leader’s storage engine. The leader reads historical data for the follower and writes new data from publishers, and the aggregate I/O can exceed disk throughput, slowing both catch-up and live replication.

  • Repeated restarts preventing catch-up completion. If the catching-up server restarts before catch-up finishes — due to OOM kills, crash loops, or automated rollouts — each restart resets catch-up progress. The follower starts from the beginning of the snapshot each time, never completing the transfer.

  • Too many Raft groups catching up simultaneously. After a server restart, all Raft groups on that server need to catch up. If the server hosts hundreds of stream and consumer replicas, the aggregate catch-up I/O can saturate both network and disk, causing all groups to catch up slowly and none to finish quickly.

How to diagnose

Check stream replica status

Inspect the per-replica state for specific streams:

Terminal window
nats stream info <stream_name>

Look for replicas showing not current status with a lag value. A replica that is catching up will show increasing or stagnant lag over consecutive checks.

Monitor catch-up progress over time

Run repeated checks to determine if the lag is decreasing:

Terminal window
# Check every 30 seconds, watch for lag trend
watch -n 30 'nats stream info <stream_name> --json | jq ".cluster.replicas[] | {name: .name, current: .current, lag: .lag, active: .active}"'

If the lag count is decreasing, catch-up is progressing and will eventually complete. If the lag is stable or increasing, the follower cannot keep up with the leader’s write rate.

Check the JetStream meta group

For system-wide visibility into catching-up members:

Terminal window
nats server report jetstream

Servers in catch-up state will show in the Raft group information. Multiple groups catching up on the same server confirms a server-level issue.

Measure network throughput between servers

Verify that bandwidth isn’t the bottleneck:

Terminal window
# Check RTT between servers
nats server list

For catch-up transfers, sustained throughput matters more than latency. Check route connection statistics for data transfer rates:

Terminal window
curl -s http://localhost:8222/routez | jq '.routes[] | {remote_id: .remote_id, in_bytes: .in_bytes, out_bytes: .out_bytes}'

Check disk I/O on the catching-up server

The follower’s disk write performance is often the limiting factor:

Terminal window
# Linux
iostat -xz 1 5
# Key metrics: await (should be < 5ms on SSD), %util (below 90%)

If the disk is saturated, catch-up data competes with every other Raft group on the server for I/O bandwidth.

Check CPU on the catching-up server

Catch-up requires the follower to apply incoming entries — decompress, validate, and write to the filestore. CPU saturation on the follower stalls apply throughput even when disk and network have headroom:

Terminal window
# Sustained > 90% on the nats-server process suggests CPU is the bottleneck
top -p $(pidof nats-server)

Compression overhead from S2-compressed streams, contention from co-located workloads, and cgroup CPU limits set too low for the catch-up workload are the most common causes of CPU-bound catch-up.

How to fix it

Immediate: accelerate the catch-up

Temporarily reduce publish rate to the affected stream. If the stream is receiving heavy write traffic, reducing the rate gives the catching-up follower more headroom to close the gap. This is a temporary measure — coordinate with publishers if possible:

Terminal window
# Check current stream message rate
nats stream report

Prioritize the catching-up server’s resources. If other non-critical workloads on the catching-up server are consuming disk I/O or CPU, reduce or pause them until catch-up completes.

Short-term: address the resource bottleneck

Upgrade disk I/O on the follower. NVMe storage with sustained write throughput of 1GB/s+ handles catch-up transfers from even the largest streams without becoming the bottleneck:

1
# Server configuration — dedicated fast storage
2
jetstream {
3
store_dir: "/data/jetstream" # NVMe volume
4
max_file_store: 500G
5
}

Ensure adequate network bandwidth between servers. Catch-up for a 100GB stream at 1Gbps takes ~15 minutes under ideal conditions. At 100Mbps, that extends to 2.5 hours — during which new data continues accumulating. Size network links for catch-up throughput, not just steady-state replication:

1
// Go — verify Raft group health after stream operations
2
js, _ := nc.JetStream()
3
4
info, _ := js.StreamInfo("ORDERS")
5
for _, r := range info.Cluster.Replicas {
6
if !r.Current {
7
log.Printf("Replica %s is catching up, lag: %d", r.Name, r.Lag)
8
}
9
}
1
# Python — check replica catch-up status
2
import nats
3
4
nc = await nats.connect()
5
js = nc.jetstream()
6
7
info = await js.stream_info("ORDERS")
8
for replica in info.cluster.replicas:
9
if not replica.current:
10
print(f"Replica {replica.name} catching up, lag: {replica.lag}")

Long-term: prevent sustained catch-up scenarios

Right-size Raft groups per server. Fewer Raft groups per server means faster individual catch-up after restarts. Use placement tags to limit Raft group density:

1
# Tag servers for stream placement budgeting
2
server_tags: ["jetstream", "region-us-east"]

Implement rolling restart procedures. Instead of restarting all servers simultaneously (which causes every Raft group to catch up everywhere), restart one server at a time and wait for all its Raft groups to become current before proceeding to the next:

Terminal window
# Rolling restart — wait for catch-up between restarts
nats server report jetstream
# Verify all replicas show "current" before restarting the next server

Use stream compression for large streams. Compressed streams transfer smaller snapshots during catch-up, reducing both network transfer time and disk write volume:

Terminal window
# Enable S2 compression on the stream
nats stream edit <stream_name> --compression s2

Monitor catch-up duration as an SLA metric. Track how long Raft groups spend catching up after planned maintenance. If catch-up time exceeds your RTO (recovery time objective), the stream is too large for the available infrastructure or needs more replicas distributed across more servers.

Frequently asked questions

How long should Raft catch-up take?

It depends on the data volume and infrastructure. A stream with 1GB of data on NVMe storage with 10Gbps networking should catch up in seconds. A 500GB stream on network-attached storage with 1Gbps networking may take 30+ minutes. The catch-up rate is bounded by the slowest of: leader read I/O, network throughput, and follower write I/O. If catch-up takes more than 10 minutes for a moderately sized stream (under 50GB), investigate the bottleneck.

Does a catching-up replica affect stream availability?

Not directly — the stream remains available as long as a quorum of replicas is online and current. In an R3 stream, one catching-up replica leaves two voting members, which is still quorum. However, the safety margin is gone: one more failure loses quorum and stalls the stream. During catch-up, the stream is operating at reduced fault tolerance.

Can I force a Raft group to restart catch-up from scratch?

Yes, by removing and re-adding the peer. This forces a clean snapshot transfer instead of incremental log replay, which can be faster if the follower’s local state is severely diverged:

Terminal window
nats stream cluster peer-remove <stream_name> <peer_name>
# Wait for the stream to stabilize with R-1 replicas
# the cluster auto-replaces removed peers based on placement (no peer-add subcommand)

Use this as a last resort — removing a peer temporarily reduces replication factor.

Why does catch-up restart from zero after a server restart?

If the server restarts before catch-up completes, the partially received snapshot is discarded. Raft snapshots are atomic — they must be fully received and validated before replacing the follower’s state. Partial snapshots cannot be resumed. This is why preventing unnecessary restarts during catch-up is important. Check for OOM kills, watchdog timers, or orchestration tools that might restart the server prematurely.

Is catching up different from replica lag?

Yes. Replica lag (JETSTREAM_001) measures how far behind a replica’s last sequence number is from the leader — it applies to replicas that are participating in the Raft group normally but processing entries slowly. Catching up is a distinct Raft state where the member is not participating in normal replication at all — it’s receiving a snapshot or replaying historical log entries. A catching-up member transitions to a lagging replica (and eventually a current replica) once catch-up completes.

Proactive monitoring for NATS raft sustained catching up with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel