Why are NATS reconnects slow in Kubernetes?

Short answer: slow NATS reconnects in Kubernetes usually come from one of three places: the client has not detected that the current TCP connection is unusable, DNS is not resolving to reachable servers, or the server is advertising client URLs that the client cannot actually reach.

This FAQ is based on a recurring community question: why are NATS client reconnects in Kubernetes sometimes nearly instant, but other times take many seconds?

Reconnect speed has two separate parts:

Failure detection: how quickly the client decides the current connection is no longer usable.
Reconnect establishment: how quickly the client can connect to another reachable NATS server.

Reconnect wait and jitter settings affect the second part. They do not necessarily make failure detection faster, especially when a firewall, network policy, or node issue silently drops packets instead of closing the TCP connection.

Why can NATS reconnects be fast sometimes and slow other times?

Because graceful shutdowns and hard network failures behave differently.

When a NATS server pod exits cleanly, the TCP connection can usually close in a way the client detects promptly. The client can then reconnect quickly.

When traffic is silently dropped, the client may not receive a TCP close. The connection can appear open to the operating system until NATS protocol pings or TCP behavior prove otherwise. In that case, reconnects can appear to take 30 seconds or more even when reconnect wait settings are very small.

What should I check first?

First, identify the failure mode you are testing.

Is this a planned restart or rolling update?

For planned maintenance and rolling updates, use Lame Duck Mode. Lame Duck Mode lets a server stop accepting new clients and drain existing clients before exiting.

The official NATS Helm chart is designed to use this pattern during rolling updates.

Use this path when:

You are deleting or restarting NATS pods intentionally.
You are performing a Kubernetes rollout.
You want clients to move away from a server before it exits.

Is this a hard disconnect or packet drop?

A different failure mode is a network interruption where packets are silently dropped. Examples include:

firewall rule changes
Kubernetes network policy changes
node-level network failures
CNI or routing failures
other cases where the client does not receive a TCP close

In this situation, the client must detect that the connection is stale. Reconnect tuning alone does not solve this class of failure.

Do reconnect wait and jitter make NATS detect failures faster?

No. Reconnect wait and jitter control how the client spaces out reconnect attempts after it knows it needs to reconnect.

They do not guarantee that the client will detect a dead connection immediately.

For example, values like these may make retry attempts very aggressive:

reconnect minimum: 0 ms
reconnect maximum: 10 ms
reconnect jitter: 10 ms

Those settings can reduce the delay between reconnect attempts, but they do not solve a blackholed TCP connection. If the current socket still appears open to the operating system, the client first needs a way to conclude that the connection is unhealthy.

How should I use Kubernetes DNS for NATS reconnects?

Use Kubernetes DNS deliberately, and make sure the names your clients use resolve to addresses they can reach.

For clients running in the same Kubernetes network, a headless Service can be a good fit because it can return records for the backing pods. That allows clients to discover NATS server addresses through Kubernetes DNS.

However, the important requirement is not that the Service is headless. The important requirement is that the client is configured with one or more names that resolve to reachable NATS servers.

Common patterns include:

A single DNS name that resolves to multiple NATS server addresses.
Multiple DNS names, one per NATS server.
A Kubernetes Service name, when that fits your network and routing model.

NATS clients can reconnect using their configured server URLs. If Kubernetes keeps the DNS record up to date, the client can resolve it again when reconnecting.

Should I enable advertised client URLs in Kubernetes?

Only enable advertised client URLs when each server advertises a host and port that clients can actually connect to.

Server-advertised client URLs can be useful in some deployments, but they can also be harmful if the advertised address is not reachable by the client.

In Kubernetes, this is a common source of confusion. A server may know itself by a pod hostname, pod IP, internal cluster name, or other address that is not the same address clients should use. If clients can only reach NATS through specific Kubernetes DNS names or service routes, advertising another name may not help and may make reconnect behavior worse.

Practical guidance:

If clients already have a reliable DNS name that resolves to reachable NATS servers, advertising client URLs may be unnecessary.
Only enable advertised client URLs after verifying each advertised address is reachable by clients.
If client logs show an empty discovered URL list, that is not automatically the root problem when the configured DNS URL is sufficient for reconnects.

How do I tune NATS clients for hard disconnect detection?

For hard network disconnect testing, look at the client’s ping interval and maximum outstanding pings.

In the NATS .NET client, these options are configured on NatsOpts:

1
using NATS.Net;
2
using NATS.Client.Core;
3

4
await using var client = new NatsClient(new NatsOpts
5
{
6
    Url = "nats://demo.nats.io:4222",
7
    PingInterval = TimeSpan.FromSeconds(1),
8
    MaxPingOut = 3,
9
});

With this kind of configuration, the client sends protocol pings periodically and treats the connection as stale after too many unanswered pings. The exact detection time depends on timing and implementation details, but conceptually a PingInterval of 1 second and MaxPingOut of 3 targets detection after a few missed ping intervals rather than waiting for a much longer TCP timeout.

There is a tradeoff: very aggressive ping settings can make clients more sensitive to transient latency, CPU pauses, overloaded servers, or short network hiccups. Choose values that match your environment and test them under realistic load and failure scenarios.

What is the recommended Kubernetes troubleshooting checklist?

When NATS reconnects are inconsistent in Kubernetes, work through this checklist:

Test graceful restarts separately from hard disconnects. A kubectl delete pod test and a firewall-drop test exercise different behavior.
Use Lame Duck Mode for planned rollouts. This gives clients a clean path away from a server that is about to exit.
Confirm the client URL resolves to reachable servers. For example, verify that your Kubernetes DNS name returns the expected records from the client network.
Avoid misleading advertised URLs. Do not advertise pod names, hostnames, or addresses that clients cannot reach.
Tune ping detection for blackholed connections. Reconnect wait settings are not a substitute for stale connection detection.
Coordinate with the platform team. Ask how firewall changes, network policies, pod termination, DNS, and node failures behave in your Kubernetes environment.

What should I ask my platform team?

If another team manages the Kubernetes environment, ask questions that separate NATS client behavior from platform networking behavior:

During pod termination, does the platform allow the process to shut down gracefully?
Are firewall or network policy changes applied in a way that may silently drop existing connections?
Are DNS records for headless Services updated promptly when pods are added or removed?
Can client workloads reach pod IPs directly, or must they use a Service path?
Are there node, CNI, or firewall timeouts that affect long-lived TCP connections?
Is there any infrastructure load balancer or proxy between clients and NATS?

These answers help determine whether you are debugging a NATS reconnect policy, stale TCP detection, or Kubernetes networking behavior.

Quick decision guide

For planned NATS pod restarts: use Lame Duck Mode and graceful shutdown behavior.
For DNS-related reconnect issues: verify that the configured URL resolves to reachable NATS servers from the client network.
For incorrect discovered URLs: review advertised client URLs and remove or correct addresses clients cannot reach.
For packet drops or blackholed connections: tune client ping detection; reconnect wait and jitter are not enough.
For inconsistent behavior across environments: involve the platform team and compare pod termination, DNS, network policy, and TCP handling.

Summary

Fast NATS reconnects in Kubernetes depend on both clean server discovery and timely failure detection. For normal rollouts, prefer Lame Duck Mode and graceful shutdown. For hard network failures, tune client ping detection, because reconnect wait and jitter only apply after the client has decided the existing connection is unusable. Use Kubernetes DNS or headless Services when they resolve to reachable addresses, and avoid advertised client URLs unless they are correct from the client’s point of view.

Want help from the NATS experts? Meet with our architects to get help tailored to your use case and environment.