A NATS micro service that was running in the previous monitoring epoch is no longer responding. Zero instances of this service are discoverable via the NATS services protocol, meaning all instances have either crashed, lost their NATS connection, or been shut down. Any workload that depends on this service will experience request timeouts and errors.
NATS micro services use the built-in services protocol (defined in ADR-32) to register themselves on the NATS network. They’re discoverable via standard subjects ($SRV.PING, $SRV.INFO, $SRV.STATS) without any external service registry. When a service disappears entirely — going from one or more running instances to zero — every request sent to that service’s endpoints goes unanswered. Clients using request-reply patterns receive timeout errors. Downstream services that depend on it start failing. If the service handles a critical business function (payment processing, authentication, data enrichment), the outage impacts end users directly.
Unlike a partial degradation where some instances remain, a complete service outage means there’s no capacity to handle any requests. NATS doesn’t queue requests for services that don’t exist — request messages have no subscribers, so they’re dropped. There’s no backlog building up to be processed when the service returns. Every request that arrives while the service is down is a failed request, and the calling client must handle the timeout.
The services protocol also means NATS itself is healthy — the service disappeared from a functioning messaging system. The problem is in the application layer: the service process, its deployment, its dependencies, or its NATS connection. This distinction matters for troubleshooting. Don’t look at NATS server health first — look at the service itself.
Service process crashed. An unhandled exception, panic, segfault, or OOM kill terminated the process. If there’s no process supervisor (systemd, Kubernetes, etc.) or the supervisor has exhausted its restart limit, the service stays down.
NATS connection lost. The service is running but lost its connection to NATS. Expired credentials, a network change, or a server-side disconnection (slow consumer eviction, auth failure) severed the connection. Without a NATS connection, the service can’t respond to discovery pings and appears down even though the process is alive.
Failed deployment. A new version was deployed but failed to start. The old instances were terminated as part of the rollout, and the new instances crashed on startup due to a configuration error, missing dependency, or incompatible change. This is the most common cause during deployment windows.
Container or pod eviction. In Kubernetes, the service pod was evicted due to resource limits (memory, CPU), node pressure, or a preemption event. If the pod can’t be rescheduled (insufficient cluster resources, node affinity constraints, persistent volume binding), the service stays down.
Dependency failure. The service has a hard dependency (database, external API, configuration service) that’s unavailable. The service either exits on startup because it can’t connect to the dependency, or it connected to NATS but entered an unhealthy state where it can’t process pings.
Credential expiration. NATS credentials (JWT, NKey) used by the service expired. The NATS server rejects the connection, the service can’t reconnect, and it drops off the network. This commonly surfaces when credential rotation processes fail or when short-lived credentials aren’t being refreshed.
nats micro ping <service-name>If no instances respond, the service is confirmed down. For a broader view of all services:
nats micro listThis shows all discoverable micro services and their instance counts. The affected service will either be absent or show zero instances.
nats micro stats <service-name>If the service was recently running, cached stats may still be available, showing when the last requests were processed. This helps narrow the outage window.
On the host(s) where the service should be running:
# systemd-managed servicesystemctl status <service-unit>journalctl -u <service-unit> --since "30 min ago" | tail -50
# Kuberneteskubectl get pods -l app=<service-name>kubectl describe pod <pod-name>kubectl logs <pod-name> --tail=100
# Dockerdocker ps -a | grep <service-name>docker logs <container-id> --tail 100Look for exit codes, crash messages, and restart counts. A high restart count in Kubernetes (CrashLoopBackOff) indicates the service keeps crashing on startup.
If the service process is running but not appearing on NATS:
# Check all connections for the service namenats server report connections | grep <service-name>If the service doesn’t appear in the connection list, it’s not connected to NATS. Check the service logs for connection errors:
1[ERR] nats: authorization violation2[ERR] nats: connection closed, reconnecting...3[ERR] nats: max reconnect attempts reachedIf some instances might be partially alive:
# Try calling the service directlynats request <service-subject> '{"test": true}' --timeout 5sA timeout confirms no instances are processing requests. An error response would indicate the service is connected but unhealthy.
# Kuberneteskubectl rollout status deployment/<service-name>kubectl rollout history deployment/<service-name>
# Check if a recent rollout is stuckkubectl get replicaset -l app=<service-name>A stuck rollout (old instances terminated, new ones not starting) is a common cause of complete service outage.
Restart the service process:
# systemdsystemctl restart <service-unit>
# Kubernetes — if it's in CrashLoopBackOff, fix the root cause firstkubectl rollout restart deployment/<service-name>
# Dockerdocker restart <container-id>If the deployment failed, roll back:
# Kuberneteskubectl rollout undo deployment/<service-name>
# Verify the rollback succeededkubectl rollout status deployment/<service-name>nats micro ping <service-name>If credentials expired, refresh them:
# Regenerate NATS credentialsnsc generate creds -a <account> -n <user> -o /path/to/service.creds
# Restart the service with new credentialssystemctl restart <service-unit>Verify the service is back:
nats micro ping <service-name>nats micro info <service-name>Run multiple service instances. A single instance is a single point of failure. NATS micro services naturally load-balance across instances using queue subscriptions — all instances of the same service share request load:
1// Go: create a NATS micro service with proper error handling2import (3 "github.com/nats-io/nats.go"4 "github.com/nats-io/nats.go/micro"5)6
7nc, _ := nats.Connect("nats://localhost:4222",8 nats.Name("order-service"),9 nats.MaxReconnects(-1), // reconnect indefinitely10 nats.ReconnectWait(2 * time.Second),11 nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {12 log.Printf("Disconnected: %v, will reconnect", err)13 }),14 nats.ReconnectHandler(func(nc *nats.Conn) {15 log.Printf("Reconnected to %s", nc.ConnectedUrl())16 }),17)18
19svc, _ := micro.AddService(nc, micro.Config{20 Name: "order-service",21 Version: "1.0.0",22})23
24svc.AddEndpoint("create", micro.HandlerFunc(func(req micro.Request) {25 // Handle order creation26 req.Respond([]byte(`{"status": "created"}`))27}))1# Python: NATS micro service with reconnection2import nats3from nats.aio.client import Client4
5async def run_service():6 nc = await nats.connect(7 servers=["nats://localhost:4222"],8 max_reconnect_attempts=-1, # reconnect indefinitely9 reconnect_time_wait=2,10 disconnected_cb=lambda: print("Disconnected, reconnecting..."),11 reconnected_cb=lambda: print("Reconnected"),12 )13
14 # Service request handler15 async def handle_request(msg):16 await msg.respond(b'{"status": "ok"}')17
18 # Subscribe as a queue group for load balancing19 await nc.subscribe("orders.create", cb=handle_request, queue="order-service")Configure process supervision with restart policies:
1# Kubernetes: ensure pods restart and spread across nodes2apiVersion: apps/v13kind: Deployment4spec:5 replicas: 36 strategy:7 rollingUpdate:8 maxUnavailable: 19 maxSurge: 110 template:11 spec:12 topologySpreadConstraints:13 - maxSkew: 114 topologyKey: kubernetes.io/hostname15 whenUnsatisfiable: DoNotScheduleImplement health checks in your orchestrator that verify NATS connectivity, not just process liveness:
# Health check that verifies NATS micro service is respondingnats micro ping <service-name> --count 1 --timeout 5sUse canary deployments. Don’t replace all instances at once. Deploy the new version as additional instances alongside the old ones. Verify the new instances respond correctly via nats micro stats, then gradually remove old instances.
Implement credential rotation automation. Set up automated credential refresh well before expiration. Monitor credential expiry dates and alert when they’re within 30 days:
# Check credential expirynsc describe user -a <account> -n <user> | grep ExpiresMonitor service availability continuously. Don’t rely on deployment-time checks. Continuously verify service health.
Synadia Insights monitors service availability by comparing instance counts between collection epochs. When a service that had instances in the previous epoch drops to zero, a critical alert fires immediately — catching outages even when your deployment system reports success.
No. NATS micro services use request-reply, which requires an active subscriber to respond. If no service instances are subscribed, the request message has no subscribers and the caller receives a “no responders” error (or a timeout, depending on client configuration). To queue work for later processing, use JetStream streams and pull consumers instead of direct request-reply.
Use nats micro ping <service-name>. If it returns responses, the service is running but may be slow — check nats micro stats <service-name> for processing time metrics. If ping returns no responses, the service is genuinely down. For intermittent issues, the processing time and error count in stats help distinguish between “down” and “degraded.”
Yes. Each instance connects to its local NATS cluster and registers via the services protocol. If clusters are connected via gateways (supercluster), service discovery propagates across clusters — a request in cluster A can be served by an instance in cluster B. However, cross-cluster requests add gateway latency. For latency-sensitive services, deploy instances in each cluster where clients exist.
NATS micro services use the services protocol (ADR-32) which adds structured discovery ($SRV.PING), metadata ($SRV.INFO), and statistics ($SRV.STATS). A regular subscriber can handle requests but isn’t discoverable or monitorable through the standard protocol. The services framework also handles queue group setup, endpoint routing, and error responses automatically. Insights can monitor micro services because they’re discoverable — it can’t track arbitrary subscribers.
The most direct approach is periodic nats micro ping from a monitoring system. For Prometheus-based monitoring, expose the service instance count as a metric and alert when it drops below your minimum:
# Simple external checknats micro ping <service-name> --count 1 --timeout 5s || echo "ALERT: service down"Synadia Insights handles this automatically — it detects when a previously running service stops responding and fires a critical alert within one collection epoch.
With 100+ always-on audit Checks from the NATS experts, Insights helps you find and fix problems before they become costly incidents.
No alert rules to write. No dashboards to maintain.
News and content from across the community