Checks/OPT_SYS_021

NATS R1 Streams in Multi-Node Clusters: Single Points of Failure

Severity
Info
Category
Health
Applies to
JetStream
Check ID
OPT_SYS_021
Detection threshold
Stream has replicas=1 while cluster has 3+ nodes

An R1 stream stores data on a single server with no replicas. In a multi-node NATS cluster, this means the stream is a single point of failure. If the hosting node goes down — planned maintenance, hardware failure, network partition — that stream is completely offline until the node recovers. No reads, no writes, no consumer delivery. For critical data, this is an availability gap that your cluster topology was designed to prevent.

Why this matters

The entire point of running a multi-node NATS cluster is fault tolerance. With R3 (three replicas), a stream survives the loss of any single node. The remaining two replicas maintain quorum, and reads and writes continue without interruption. R1 bypasses this safety net entirely.

The failure mode is binary and immediate. When an R3 stream loses a node, it degrades gracefully — one replica is down, but two maintain quorum and the stream remains fully operational. When an R1 stream loses its node, it goes from fully operational to completely unavailable in an instant. There’s no degraded state, no reduced throughput, just an outage.

During the outage window, publishers to the R1 stream receive errors. Consumers stop receiving messages. If the stream is a WorkQueue, processing halts. If it’s a KV bucket backing configuration data, dependent services lose access to their configuration. If it’s an event stream that other services depend on, the downstream cascade begins.

Recovery depends entirely on the failed node coming back. If the node suffered a disk failure, recovery requires restoring from backup (if one exists) or accepting data loss. Even for a clean restart — say, a rolling upgrade — the stream is offline for the full duration of the node’s restart cycle. In large clusters with many streams and consumers, restart can take minutes as the node replays WAL files and rebuilds state.

The risk is amplified by the fact that R1 streams are often created unintentionally. The default replica count in many NATS client libraries and CLI tools is 1. Developers creating streams in development (where R1 is fine) carry that configuration into production without adjusting the replica count.

Common causes

  • Default stream configuration. nats stream add defaults to R1 if --replicas is not specified. Programmatic stream creation often uses the library default, which is also R1 in most SDKs.

  • Development configuration carried to production. Streams created and tested in a single-node dev environment are deployed to production multi-node clusters without updating the replica count.

  • Cost-conscious overuse of R1. Teams intentionally use R1 to reduce storage costs (R3 triples disk usage). This is a valid trade-off for ephemeral or reproducible data, but it’s often applied too broadly, including to streams carrying critical business data.

  • Ephemeral streams that became permanent. A stream was created as a temporary staging area or test stream with R1. Over time, it became load-bearing as other services started depending on it, but the replica count was never updated.

  • KV buckets defaulting to R1. NATS KV buckets are backed by streams, and nats kv add defaults to R1. KV buckets used for configuration, feature flags, or service discovery may carry critical data on a single replica.

How to diagnose

List all R1 streams in a multi-node cluster

Terminal window
# Check cluster size first
nats server list
# List streams with their replica count
nats stream list --json | jq '.[] | select(.config.num_replicas == 1) | {name: .config.name, replicas: .config.num_replicas, subjects: .config.subjects}'

Check specific streams

Terminal window
nats stream info MY_STREAM

Look at Replicas in the output. If it shows 1 and your cluster has 3+ nodes, this stream has no redundancy.

Identify R1 KV buckets

Terminal window
nats stream list --json | jq '.[] | select(.config.name | startswith("KV_")) | select(.config.num_replicas == 1) | .config.name'

Assess criticality

Not all R1 streams need to be upgraded. Evaluate each one:

Terminal window
# Check message rate and consumer count
nats stream info MY_STREAM --json | jq '{
name: .config.name,
messages: .state.messages,
bytes: .state.bytes,
consumers: .state.consumer_count,
subjects: .config.subjects
}'

Streams with active consumers, high message counts, or subjects that other services depend on are candidates for R3 upgrade.

How to fix it

Upgrade critical streams to R3

You can update the replica count on an existing stream without downtime:

Terminal window
nats stream edit MY_STREAM --replicas 3

The server will begin replicating existing data to two additional nodes. During replication, the stream remains fully available. Monitor the replica catch-up:

Terminal window
nats stream info MY_STREAM

Watch the Replicas section — new replicas will show as catching up until they’re fully synchronized.

Set R3 at stream creation time

Always specify replicas explicitly when creating streams in production:

Terminal window
nats stream add ORDERS --subjects "orders.>" --replicas 3 --retention limits --max-age 7d

In Go:

1
js, _ := nc.JetStream()
2
_, err := js.AddStream(&nats.StreamConfig{
3
Name: "ORDERS",
4
Subjects: []string{"orders.>"},
5
Replicas: 3,
6
MaxAge: 7 * 24 * time.Hour,
7
})

In Python:

1
import nats
2
from nats.js.api import StreamConfig
3
4
nc = await nats.connect()
5
js = nc.jetstream()
6
7
await js.add_stream(
8
StreamConfig(
9
name="ORDERS",
10
subjects=["orders.>"],
11
num_replicas=3,
12
max_age=7 * 24 * 3600, # 7 days in seconds
13
)
14
)

Upgrade R1 KV buckets

Terminal window
nats kv update MY_CONFIG --replicas 3

Know when R1 is acceptable

R1 is a valid choice for specific use cases. Don’t blindly upgrade everything:

  • Ephemeral work queues where messages are consumed within seconds and can be republished on failure
  • Cache-tier data that can be rebuilt from a source of truth
  • Development and testing environments
  • High-volume, low-value telemetry where occasional data loss is acceptable
  • Streams sourced from another R3 stream (the source provides durability)

Document R1 decisions explicitly so future operators understand the trade-off:

Terminal window
nats stream edit TELEMETRY_RAW --description "R1 intentional: ephemeral telemetry, source is R3 TELEMETRY_PROCESSED"

Automate replica enforcement

Create a CI check or operational script that flags R1 streams in production clusters:

#!/bin/bash
# Flag R1 streams in clusters with 3+ nodes
NODES=$(nats server list --json | jq length)
if [ "$NODES" -ge 3 ]; then
R1_STREAMS=$(nats stream list --json | jq '[.[] | select(.config.num_replicas == 1) | .config.name] | length')
if [ "$R1_STREAMS" -gt 0 ]; then
echo "WARNING: $R1_STREAMS R1 stream(s) in a $NODES-node cluster"
nats stream list --json | jq '.[] | select(.config.num_replicas == 1) | .config.name'
fi
fi

Frequently asked questions

Can I change replica count without downtime?

Yes. Increasing replicas from R1 to R3 is an online operation. The existing leader continues serving reads and writes while the new replicas catch up. Depending on stream size, catch-up may take seconds to hours. Monitor with nats stream info — the new replicas transition from catching up to current when synchronized.

Does R3 triple my storage costs?

Yes, R3 stores three copies of every message across three nodes. For disk-heavy workloads, this can be significant. Mitigate by setting appropriate max_age, max_bytes, or max_msgs limits on streams, and use compression (--compression s2) for large streams. The storage cost of R3 is the price of availability — evaluate it against the cost of the outage R1 exposes you to.

What about R5?

NATS supports R5 (five replicas) for deployments requiring survival of two simultaneous node failures. R5 is rarely needed — it increases storage and Raft overhead. R3 is the standard recommendation for production high availability. Use R5 only if your failure domain analysis specifically requires it.

What happens to in-flight messages when an R1 stream’s node goes down?

Unacknowledged messages are unavailable until the node recovers. For WorkQueue retention streams, this means processing halts. For limits-retention streams, consumers can’t read any messages. Publishers receive errors (no responders or timeout). With R3, the remaining two replicas elect a new leader within seconds and processing continues.

Does Insights flag all R1 streams?

Insights flags R1 streams specifically in multi-node clusters (3+ nodes) where redundancy is available but not being used. In single-node deployments, R1 is the only option and is not flagged. The check helps you identify streams that could benefit from the redundancy your cluster topology already provides.

Proactive monitoring for NATS r1 streams in multi node clusters with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel