Using NATS Gateways with Kubernetes Load Balancers

A common community question is how to expose NATS gateways from Kubernetes or OpenShift when building a supercluster across regions, especially when JetStream readiness reports errors such as:

1
Healthcheck failed: "JetStream has not established contact with a meta leader"

The short version: keep local cluster formation independent from your external load balancers, and use load balancers only where Kubernetes networking requires them for cross-region gateway reachability.

Gateways are for connecting clusters, not forming a local cluster

A NATS supercluster is made from multiple NATS clusters connected with gateways. Within a single Kubernetes cluster or region, the NATS servers should still form their local NATS cluster using the normal cluster routes.

Do not put an external load balancer in the critical path for the route connections between pods in the same local NATS cluster. In Kubernetes, local clustering is typically handled with internal service discovery, such as the service patterns used by the NATS Helm chart.

Use the gateway port for traffic between NATS clusters in different regions or networks.

A practical target architecture is:

NATS pods in a region form a healthy local cluster first.
JetStream can elect local metadata leadership as expected for that cluster.
The region exposes a gateway endpoint reachable by other regions.
Other regions connect to that gateway endpoint to form the supercluster.

Usually use one load balancer per regional gateway endpoint

For Kubernetes and OpenShift deployments, an external LoadBalancer service is often a convenient way to make the gateway port reachable from another region.

In most cases, start with one load balancer in front of the NATS servers in a region for gateway traffic, rather than one load balancer per pod. A per-pod load balancer design can work in some environments, but it is usually more expensive and operationally more complex, and it can make bootstrap and health-check behavior harder to reason about.

NATS itself does not require load balancers for gateways. The load balancer is a Kubernetes or cloud networking mechanism that provides an externally reachable address.

Configure gateway advertise addresses carefully

If a NATS server is behind a load balancer, remote clusters need an address they can actually reach. That usually means configuring the gateway advertise address to the external DNS name and port exposed by the load balancer.

A simplified gateway configuration looks like this:

1
gateway {
2
  name: us-east
3
  port: 7222
4
  advertise: nats-gw-us-east.example.com:7222
5

6
  gateways: [
7
    {
8
      name: eu-west
9
      urls: [nats://nats-gw-eu-west.example.com:7222]
10
    }
11
  ]
12
}

Adapt this to your actual deployment, including TLS, authentication, account configuration, and the names of your regions. The important point is that the advertise value and the urls used for remote gateways must be reachable from the other NATS clusters.

Decide whether JetStream spans regions

Whether a cold restart of one region can stall on cross-region connectivity depends on how JetStream is partitioned across the supercluster.

By default, every JetStream-enabled server in a supercluster shares a single JetStream domain and forms one JetStream metadata group across all clusters. In that model, metadata leadership and recovery depend on gateway connectivity between regions. If a region restarts and cannot reach the other regions over gateways, its servers can keep reporting that JetStream has not established contact with a meta leader.

Alternatively, you can give each region its own JetStream domain. Each region then runs an independent metadata group that elects a leader using only its local servers, so a regional restart does not depend on cross-region gateway health. The tradeoff is that JetStream assets no longer span regions transparently: a stream lives in a single domain, and moving or replicating data between domains requires explicit cross-domain configuration, such as sourcing or mirroring through a domain-qualified JetStream API.

Neither option is automatically correct:

Choose a single shared domain when you need streams and consumers to behave as one JetStream system across regions, and you accept that cross-region gateway health is part of JetStream availability.
Choose one domain per region when you want each region’s JetStream to start and operate independently, and you can treat cross-region data movement as a separate, explicit concern.

This is an architectural decision rather than a load balancer setting. The rest of this post applies either way, but the single-domain model is where readiness and health-check mistakes most easily turn into the deadlock described next.

Avoid circular readiness dependencies

The error:

1
JetStream has not established contact with a meta leader

means the server has not yet established contact with the JetStream metadata leader. During startup, that may be temporary. But in Kubernetes, a readiness or load balancer health check can accidentally turn that temporary state into a deadlock.

A common failure mode looks like this:

A regional NATS cluster starts with gateway configuration already enabled.
Kubernetes readiness or the external load balancer requires a JetStream condition that is not yet satisfied.
The load balancer does not route gateway traffic to the pods because they are not considered healthy.
If JetStream metadata leadership spans regions (a single shared JetStream domain), the pods cannot finish JetStream startup, because the cross-region gateway connectivity they need is unavailable.
Readiness never becomes healthy.

If disabling readiness makes the cluster start, that is a strong signal that readiness gating or load balancer health checks are involved. It does not mean readiness should simply be removed in production. It means the health checks need to match the bootstrap behavior you want.

Check what the load balancer is actually probing

The NATS monitoring port is commonly exposed on 8222, and health checks often use the /healthz endpoint.

For example, deployments may use a path like:

1
/healthz?js-server-only=true

The js-server-only=true form is intended to report on the JetStream subsystem of the individual server rather than on metadata leadership. With that parameter, a server can report healthy even when it has not yet established contact with a JetStream metadata leader; this behavior was confirmed for NATS Server 2.11.4. A plain /healthz with no parameters, by contrast, does check JetStream metadata health, so it reports unhealthy while there is no meta leader. If your probe is reporting the meta leader error, that is a hint the probe may not be using the parameters you expect.

Because the exact semantics of /healthz parameters can change between NATS Server versions, verify the behavior for the version you run. Also verify what is actually probed: Kubernetes service annotations, cloud provider load balancer behavior, Helm chart settings, and OpenShift configuration can all change the effective probe.

When troubleshooting, confirm all of the following:

The load balancer is using HTTP, not a mismatched TCP or HTTPS probe.
The probe targets the NATS monitoring port, commonly 8222.
The probe path is exactly what you expect, including query parameters.
The probe is checking the gateway service endpoints you intend it to check.
The Kubernetes pod readiness probe and the cloud load balancer health probe are not using different health semantics.

Do not assume that adding annotations or service options changed the cloud load balancer behavior. Check the effective load balancer configuration in the cloud or OpenShift control plane.

Bootstrap the local cluster first

Before debugging the supercluster, confirm that each regional cluster starts cleanly without gateways enabled.

A useful sequence is:

Deploy the NATS cluster in one region without gateway connectivity.
Confirm the local NATS cluster forms.
Confirm JetStream is healthy locally.
Expose the gateway port through the Kubernetes service or load balancer.
Add gateway configuration.
Roll out the change carefully.
Test a full cold restart of a region, not only a rolling update.

A rolling update can hide bootstrap dependencies because at least part of the system remains available while each pod restarts. A full regional restart is a better test of whether readiness checks, load balancer health checks, and gateway advertise settings are safe during cold start.

Practical checklist

When a Kubernetes-hosted NATS supercluster hangs with JetStream metadata readiness errors, check these items first:

Local routes do not depend on an external load balancer.
Each regional NATS cluster can become healthy on its own.
You have decided whether JetStream uses one shared domain across regions or a separate domain per region, and you know whether that makes local startup depend on cross-region gateways.
Gateway traffic uses externally reachable DNS names or addresses.
gateway.advertise points to an address reachable by other regions.
The load balancer fronts the regional gateway service, not the local cluster route mesh.
The load balancer health check does not require the full supercluster to already be healthy.
Pod readiness and load balancer health checks use intentional, compatible health semantics.
A full cold restart of a region succeeds, not just rolling updates.

Summary

For NATS superclusters on Kubernetes, keep the local cluster simple and healthy first. Use Kubernetes load balancers to expose gateway traffic between regions, usually with a single regional gateway endpoint. Make sure gateway.advertise uses a reachable external address, and verify that readiness and load balancer health checks do not create a circular dependency on JetStream metadata leadership during startup. Decide deliberately whether JetStream should span all regions as one domain or run as an independent domain per region, because that choice determines whether cross-region gateway health affects local JetStream startup.

Want help from the NATS experts? Meet with our architects to get help tailored to your use case and environment.