Checks/SERVICE_002

NATS Service Down: What It Means and How to Fix It

Severity
Critical
Category
Health
Applies to
Service
Check ID
SERVICE_002
Detection threshold
Zero service instances after previous epoch had instances

A NATS micro service that was running in the previous monitoring epoch is no longer responding. Zero instances of this service are discoverable via the NATS services protocol, meaning all instances have either crashed, lost their NATS connection, or been shut down. Any workload that depends on this service will experience request timeouts and errors.

Why this matters

NATS micro services use the built-in services protocol (defined in ADR-32) to register themselves on the NATS network. They’re discoverable via standard subjects ($SRV.PING, $SRV.INFO, $SRV.STATS) without any external service registry. When a service disappears entirely — going from one or more running instances to zero — every request sent to that service’s endpoints goes unanswered. Clients using request-reply patterns receive timeout errors. Downstream services that depend on it start failing. If the service handles a critical business function (payment processing, authentication, data enrichment), the outage impacts end users directly.

Unlike a partial degradation where some instances remain, a complete service outage means there’s no capacity to handle any requests. NATS doesn’t queue requests for services that don’t exist — request messages have no subscribers, so they’re dropped. There’s no backlog building up to be processed when the service returns. Every request that arrives while the service is down is a failed request, and the calling client must handle the timeout.

The services protocol also means NATS itself is healthy — the service disappeared from a functioning messaging system. The problem is in the application layer: the service process, its deployment, its dependencies, or its NATS connection. This distinction matters for troubleshooting. Don’t look at NATS server health first — look at the service itself.

Common causes

  • Service process crashed. An unhandled exception, panic, segfault, or OOM kill terminated the process. If there’s no process supervisor (systemd, Kubernetes, etc.) or the supervisor has exhausted its restart limit, the service stays down.

  • NATS connection lost. The service is running but lost its connection to NATS. Expired credentials, a network change, or a server-side disconnection (slow consumer eviction, auth failure) severed the connection. Without a NATS connection, the service can’t respond to discovery pings and appears down even though the process is alive.

  • Failed deployment. A new version was deployed but failed to start. The old instances were terminated as part of the rollout, and the new instances crashed on startup due to a configuration error, missing dependency, or incompatible change. This is the most common cause during deployment windows.

  • Container or pod eviction. In Kubernetes, the service pod was evicted due to resource limits (memory, CPU), node pressure, or a preemption event. If the pod can’t be rescheduled (insufficient cluster resources, node affinity constraints, persistent volume binding), the service stays down.

  • Dependency failure. The service has a hard dependency (database, external API, configuration service) that’s unavailable. The service either exits on startup because it can’t connect to the dependency, or it connected to NATS but entered an unhealthy state where it can’t process pings.

  • Credential expiration. NATS credentials (JWT, NKey) used by the service expired. The NATS server rejects the connection, the service can’t reconnect, and it drops off the network. This commonly surfaces when credential rotation processes fail or when short-lived credentials aren’t being refreshed.

How to diagnose

Confirm the service is down

Terminal window
nats micro ping <service-name>

If no instances respond, the service is confirmed down. For a broader view of all services:

Terminal window
nats micro list

This shows all discoverable micro services and their instance counts. The affected service will either be absent or show zero instances.

Check if the service was recently running

Terminal window
nats micro stats <service-name>

If the service was recently running, cached stats may still be available, showing when the last requests were processed. This helps narrow the outage window.

Check the service process

On the host(s) where the service should be running:

Terminal window
# systemd-managed service
systemctl status <service-unit>
journalctl -u <service-unit> --since "30 min ago" | tail -50
# Kubernetes
kubectl get pods -l app=<service-name>
kubectl describe pod <pod-name>
kubectl logs <pod-name> --tail=100
# Docker
docker ps -a | grep <service-name>
docker logs <container-id> --tail 100

Look for exit codes, crash messages, and restart counts. A high restart count in Kubernetes (CrashLoopBackOff) indicates the service keeps crashing on startup.

Check NATS connection status

If the service process is running but not appearing on NATS:

Terminal window
# Check all connections for the service name
nats server report connections | grep <service-name>

If the service doesn’t appear in the connection list, it’s not connected to NATS. Check the service logs for connection errors:

1
[ERR] nats: authorization violation
2
[ERR] nats: connection closed, reconnecting...
3
[ERR] nats: max reconnect attempts reached

Test the service endpoint directly

If some instances might be partially alive:

Terminal window
# Try calling the service directly
nats request <service-subject> '{"test": true}' --timeout 5s

A timeout confirms no instances are processing requests. An error response would indicate the service is connected but unhealthy.

Check for recent deployment activity

Terminal window
# Kubernetes
kubectl rollout status deployment/<service-name>
kubectl rollout history deployment/<service-name>
# Check if a recent rollout is stuck
kubectl get replicaset -l app=<service-name>

A stuck rollout (old instances terminated, new ones not starting) is a common cause of complete service outage.

How to fix it

Immediate: restore the service

Restart the service process:

Terminal window
# systemd
systemctl restart <service-unit>
# Kubernetes — if it's in CrashLoopBackOff, fix the root cause first
kubectl rollout restart deployment/<service-name>
# Docker
docker restart <container-id>

If the deployment failed, roll back:

Terminal window
# Kubernetes
kubectl rollout undo deployment/<service-name>
# Verify the rollback succeeded
kubectl rollout status deployment/<service-name>
nats micro ping <service-name>

If credentials expired, refresh them:

Terminal window
# Regenerate NATS credentials
nsc generate creds -a <account> -n <user> -o /path/to/service.creds
# Restart the service with new credentials
systemctl restart <service-unit>

Verify the service is back:

Terminal window
nats micro ping <service-name>
nats micro info <service-name>

Short-term: improve resilience

Run multiple service instances. A single instance is a single point of failure. NATS micro services naturally load-balance across instances using queue subscriptions — all instances of the same service share request load:

1
// Go: create a NATS micro service with proper error handling
2
import (
3
"github.com/nats-io/nats.go"
4
"github.com/nats-io/nats.go/micro"
5
)
6
7
nc, _ := nats.Connect("nats://localhost:4222",
8
nats.Name("order-service"),
9
nats.MaxReconnects(-1), // reconnect indefinitely
10
nats.ReconnectWait(2 * time.Second),
11
nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
12
log.Printf("Disconnected: %v, will reconnect", err)
13
}),
14
nats.ReconnectHandler(func(nc *nats.Conn) {
15
log.Printf("Reconnected to %s", nc.ConnectedUrl())
16
}),
17
)
18
19
svc, _ := micro.AddService(nc, micro.Config{
20
Name: "order-service",
21
Version: "1.0.0",
22
})
23
24
svc.AddEndpoint("create", micro.HandlerFunc(func(req micro.Request) {
25
// Handle order creation
26
req.Respond([]byte(`{"status": "created"}`))
27
}))
1
# Python: NATS micro service with reconnection
2
import nats
3
from nats.aio.client import Client
4
5
async def run_service():
6
nc = await nats.connect(
7
servers=["nats://localhost:4222"],
8
max_reconnect_attempts=-1, # reconnect indefinitely
9
reconnect_time_wait=2,
10
disconnected_cb=lambda: print("Disconnected, reconnecting..."),
11
reconnected_cb=lambda: print("Reconnected"),
12
)
13
14
# Service request handler
15
async def handle_request(msg):
16
await msg.respond(b'{"status": "ok"}')
17
18
# Subscribe as a queue group for load balancing
19
await nc.subscribe("orders.create", cb=handle_request, queue="order-service")

Configure process supervision with restart policies:

1
# Kubernetes: ensure pods restart and spread across nodes
2
apiVersion: apps/v1
3
kind: Deployment
4
spec:
5
replicas: 3
6
strategy:
7
rollingUpdate:
8
maxUnavailable: 1
9
maxSurge: 1
10
template:
11
spec:
12
topologySpreadConstraints:
13
- maxSkew: 1
14
topologyKey: kubernetes.io/hostname
15
whenUnsatisfiable: DoNotSchedule

Implement health checks in your orchestrator that verify NATS connectivity, not just process liveness:

Terminal window
# Health check that verifies NATS micro service is responding
nats micro ping <service-name> --count 1 --timeout 5s

Long-term: build zero-downtime deployment

Use canary deployments. Don’t replace all instances at once. Deploy the new version as additional instances alongside the old ones. Verify the new instances respond correctly via nats micro stats, then gradually remove old instances.

Implement credential rotation automation. Set up automated credential refresh well before expiration. Monitor credential expiry dates and alert when they’re within 30 days:

Terminal window
# Check credential expiry
nsc describe user -a <account> -n <user> | grep Expires

Monitor service availability continuously. Don’t rely on deployment-time checks. Continuously verify service health.

Synadia Insights monitors service availability by comparing instance counts between collection epochs. When a service that had instances in the previous epoch drops to zero, a critical alert fires immediately — catching outages even when your deployment system reports success.

Frequently asked questions

Does NATS queue requests while a service is down?

No. NATS micro services use request-reply, which requires an active subscriber to respond. If no service instances are subscribed, the request message has no subscribers and the caller receives a “no responders” error (or a timeout, depending on client configuration). To queue work for later processing, use JetStream streams and pull consumers instead of direct request-reply.

How do I tell if a service is down vs. just slow?

Use nats micro ping <service-name>. If it returns responses, the service is running but may be slow — check nats micro stats <service-name> for processing time metrics. If ping returns no responses, the service is genuinely down. For intermittent issues, the processing time and error count in stats help distinguish between “down” and “degraded.”

Can I run the same service on multiple NATS clusters?

Yes. Each instance connects to its local NATS cluster and registers via the services protocol. If clusters are connected via gateways (supercluster), service discovery propagates across clusters — a request in cluster A can be served by an instance in cluster B. However, cross-cluster requests add gateway latency. For latency-sensitive services, deploy instances in each cluster where clients exist.

What’s the difference between a NATS micro service and a regular subscriber?

NATS micro services use the services protocol (ADR-32) which adds structured discovery ($SRV.PING), metadata ($SRV.INFO), and statistics ($SRV.STATS). A regular subscriber can handle requests but isn’t discoverable or monitorable through the standard protocol. The services framework also handles queue group setup, endpoint routing, and error responses automatically. Insights can monitor micro services because they’re discoverable — it can’t track arbitrary subscribers.

How do I set up alerting for service availability?

The most direct approach is periodic nats micro ping from a monitoring system. For Prometheus-based monitoring, expose the service instance count as a metric and alert when it drops below your minimum:

Terminal window
# Simple external check
nats micro ping <service-name> --count 1 --timeout 5s || echo "ALERT: service down"

Synadia Insights handles this automatically — it detects when a previously running service stops responding and fires a critical alert within one collection epoch.

Proactive monitoring for NATS service down with Synadia Insights

With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.

Start a 14-day Insights trial
Cancel