RethinkConn is back — the biggest NATS event of the year returns June 4. Save your (virtual) spot.
All posts
Series: NATS Edge Eventing Architecture

The Edge Talks Back: Building Bidirectional Control Into Your Architecture

The Edge Talks Back: Building Bidirectional Control Into Your Architecture

The eight posts in this series described how data moves from edge to core. That’s the right place to start — but it’s only half the picture. An architecture that can only receive from the edge is an architecture that can observe problems but not respond to them.

I want to be direct about something I’ve noticed in how engineers think about edge-to-core systems, including some engineers who’ve read everything I’ve written on the subject and still walk away with the wrong mental model.

They think of the edge as a source. Data flows out. The core receives, processes, stores, acts. The edge is the place where things happen; the core is the place where those things are understood.

That model is incomplete. And the incompleteness shows up in production in a specific, painful way: your observability tells you something is wrong at the edge, and you have no programmatic path to do anything about it. You can see the bad data. You can alert on it. You can open a ticket and wait for a field team. What you can’t do is reach back in, in real time, and fix it.

An architecture built on NATS doesn’t have this limitation — not because bidirectionality was bolted on as a feature, but because the communication model is fundamentally symmetric. The same fabric that carries telemetry from edge to core carries commands from core to edge. The same subject hierarchy that makes telemetry routing declarative makes command targeting equally precise. The same security model that constrains what data crosses the boundary constrains what commands can be issued.

The edge isn’t just a source. It’s a participant. Build your architecture accordingly.

The pattern most teams miss: request-reply at the edge

When architects think about NATS for edge-to-core systems, they reach for pub/sub and streaming first. Those are the right tools for telemetry. But request-reply — NATS’s native mechanism for synchronous-style communication over an asynchronous backbone — is what makes the edge interactive.

The pattern is simple: a publisher sends a message to a subject with a reply-to address included. Subscribers on that subject process the message and send their response back to the reply-to address. The original publisher receives the response. To the application, it looks like a synchronous call. Under the hood, it’s asynchronous, decoupled, and routed through the same NATS topology as everything else.

What makes this powerful at the edge is that the same leaf node topology that carries telemetry upstream carries requests downstream. A core service can issue a request to a subject that resolves to a specific device, a class of devices at a specific site, or a fleet-wide broadcast — and receive structured responses back. No separate protocol. No additional connection. No custom RPC layer. It’s built into the fabric.

This is what was described in a QCon presentation on fleet management with NATS: the ability to say nats request device.info and receive responses from every connected device in real time — live querying of fleet state, without a dedicated device registry, without polling infrastructure, without any additional software. The scatter-gather pattern, where a single request broadcast collects responses from multiple devices within a timeout window, gives you a live census of what’s actually online and what state it’s in.

That’s a fundamentally different operational posture than a system where the only information flow is outbound telemetry.

Responding to bad data at the source

Here is the scenario I designed the bidirectional control model to handle, and the one that makes the value concrete immediately.

Your edge telemetry pipeline is running. An anomaly detector in the core identifies a stream that’s emitting malformed readings — values outside physical plausible ranges, timestamps that are inconsistent, a signature that doesn’t validate against the expected schema. The alert fires. Now what?

In a passive architecture, the answer is: alert the team, open an incident, wait for someone to investigate, eventually quarantine the data or stop the ingest. In the meantime, corrupted data is propagating through your pipelines. Analytics are being computed on bad inputs. Downstream systems are making decisions based on signals that don’t represent reality.

In an interactive architecture built on NATS, the answer is different. The same system that detected the anomaly can issue a command — to that specific stream, at that specific edge node — to pause publication, purge accumulated messages, or modify stream configuration. The command is targeted at the subject that resolves to that device or stream. The response confirms receipt. The core system logs the intervention and continues monitoring.

This isn’t hypothetical. It’s the direct application of NATS’s request-reply pattern to operational control: send a command, receive a confirmation, take an action if confirmation doesn’t arrive within the timeout. The same subject scoping that constrains what telemetry crosses the boundary constrains what commands can be issued — so the command can only reach what the issuing credential permits, enforced at the topology level, not in application code.

The practical subject design for this looks like:

cmd.<site>.<device-class>.<device-id>.stream.pause

A core operator publishes to this subject. The device subscribed to it receives the command, pauses its stream publication, and responds to the reply-to address with a confirmation. The operator knows within a timeout whether the command was received and acted on — or whether the device is unreachable, in which case the topology’s disconnection handling takes over.

Compared to trying to build this with a separate command channel, a separate credential model, and a separate protocol — building it in NATS is almost trivially simple. The hard thinking is in the subject design and the credential scoping, not in the infrastructure.

Configuration management that survives disconnection

Command-and-response handles real-time interventions. But a large class of operational control at the edge isn’t real-time — it’s configuration state that needs to be consistent, durable, and automatically applied whenever an edge node connects or reconnects.

This is where the NATS Key-Value store changes what’s possible. The KV store is built on JetStream — which means it’s persistent, replicated, and designed to survive connectivity gaps. A configuration value written to the KV store propagates to edge nodes when they’re connected. When a node disconnects and reconnects, it automatically receives the current state of any keys it watches. There’s no polling, no retry logic, no custom configuration sync service to build and maintain.

The operational pattern for a fleet configuration update looks like this:

  1. Operator writes the new configuration value to a KV bucket key — for example, config.fleet.sampling-rate
  2. All connected edge nodes watching that key receive the update immediately
  3. Disconnected edge nodes receive the update when they reconnect — automatically, with JetStream’s built-in catch-up mechanism
  4. The KV store maintains a history of configuration changes, giving you a full audit trail of what configuration each device was running at any point in time

That last point is more operationally significant than it first appears. When an incident occurs, being able to reconstruct “what configuration was this device running when this happened” is the difference between fast root cause analysis and days of log archaeology.

This is the OTA-style update model without the OTA infrastructure overhead. The same platform that carries telemetry upstream manages configuration state downstream — consistently, durably, and without adding a separate config management system to your edge deployment.

Fleet-wide operations at the scale of the subject hierarchy

The subject hierarchy that makes telemetry routing composable makes command targeting equally composable. And this is where the bidirectional model starts to reveal its full operational power.

Consider a rolling firmware upgrade across a fleet of edge devices. With a passive edge architecture, this is a field operations problem — you deploy firmware updates through a separate mechanism, track rollout status through a separate system, handle failures through a separate process. The eventing infrastructure you built for telemetry has nothing to do with it.

With an interactive architecture, the same fabric handles it:

  • Publish the new firmware artifact reference to a KV bucket key
  • Edge nodes watching that key receive the update and begin the local upgrade process
  • Each device publishes a status update to a subject like status.<site>.<device-id>.firmware.upgrade
  • A core consumer aggregates these responses and tracks rollout progress across the fleet
  • Devices that fail the upgrade respond with failure details; the core can issue a rollback command to those specific devices
  • Disconnected devices receive the firmware reference when they reconnect and begin the upgrade on their own schedule

The subject hierarchy provides the targeting precision. The KV store provides the durability for disconnected nodes. Request-reply provides the confirmation mechanism. JetStream streaming provides the audit trail. None of these are new capabilities — they’re the same primitives described in the previous eight posts, composed differently.

What changes is the operational posture. Instead of the edge being a system you observe and then separately manage, it becomes a system you can operate from the core in real time, at fleet scale, with the same security and observability guarantees that apply to your telemetry pipelines.

The scatter-gather census: knowing what’s actually out there

One of the persistent operational challenges in edge deployments is device inventory — not the static inventory in your CMDB, but the live inventory of what’s actually connected, what firmware it’s running, and what state it’s in right now.

Traditional approaches to this problem involve heartbeat messages, device registry services, and periodic polling — all of which add infrastructure, add failure surfaces, and still give you a picture that’s seconds or minutes out of date.

The scatter-gather pattern in NATS gives you a live census on demand. Issue a request to a wildcard subject like device.info.>. Every connected device subscribed to any subject matching that pattern responds with its current state — firmware version, configuration hash, connectivity metrics, whatever the application defines. Collect responses for a defined timeout window. The result is a snapshot of everything currently online, in real time, with zero additional infrastructure.

Devices that don’t respond within the timeout are offline — which is itself useful information, distinguishable from devices that responded with error states. The same request, issued from different credential contexts, reaches different subsets of the fleet — so a regional operator sees only the devices in their region, enforced by subject scoping, not by application-level filtering.

This is the pattern demonstrated in Synadia’s fleet management work and it illustrates something important about what NATS actually is at the edge. It’s not a message queue. It’s a connective fabric that makes the edge queryable, commandable, and configurable — from the core, in real time, at the granularity that operational requirements demand.

The security model is already there

The natural concern with bidirectional control is security: if core systems can send commands to edge devices, what prevents a compromised core credential from issuing commands it shouldn’t? What prevents a malformed command from reaching the wrong device?

The answer is that the same security model described in post 3 of this series — scoped credentials, subject-level boundary constraints — is exactly what constrains command traffic as well as telemetry.

A credential issued to a core operations service can be scoped to publish only to cmd.<site>.*.*.* subjects. That credential cannot issue commands outside its permitted subject namespace, regardless of what the application code tries to do. The topology enforces it. The Synadia platform provides the control plane for managing these credential scopes across environments at scale.

This is the architectural payoff of building security into the topology rather than layering it on top: the constraints that protect your telemetry pipelines protect your command channels automatically. You don’t need a separate access control model for bidirectional communication. The subject hierarchy is the access control model, for data flowing in both directions.

What changes when you design for this from the start

The difference between an edge architecture that supports bidirectional control and one that doesn’t isn’t a feature you add later. It’s a design choice that shows up in how you structure your subjects, how you scope your credentials, and what events your edge devices are written to respond to.

The subject design for a device that can receive commands looks like this, alongside the telemetry subjects it publishes to:

1
# Telemetry (device publishes)
2
telemetry.<site>.<line>.<device-id>.temperature
3
telemetry.<site>.<line>.<device-id>.pressure
4
5
# Commands (device subscribes)
6
cmd.<site>.<line>.<device-id>.stream.pause
7
cmd.<site>.<line>.<device-id>.stream.resume
8
cmd.<site>.<line>.<device-id>.config.reload
9
10
# Status (device publishes in response to commands)
11
status.<site>.<line>.<device-id>.stream.paused
12
status.<site>.<line>.<device-id>.config.applied

The credential issued to the device is permitted to publish on telemetry.<site>.<line>.<device-id>.* and status.<site>.<line>.<device-id>.*, and to subscribe on cmd.<site>.<line>.<device-id>.*. Nothing else crosses its boundary. A compromised device credential can’t issue commands. A compromised core credential can’t publish telemetry that impersonates a device.

This is what I mean when I say the edge is a participant rather than a source. It has a defined interface — subjects it publishes to, subjects it subscribes to, subjects it responds on. That interface is enforced by the topology. And because the interface is defined, it can be tested, monitored, and reasoned about with the same rigor as any other system boundary.

The full picture

Looking back across the nine posts in this series, the architecture that emerges is one where:

  • The edge is a separate operational realm — not a remote extension of the core
  • Data moves from edge to core durably, with store-and-forward resilience and declarative flow control
  • The security boundary is enforced at the topology level, not in application code
  • A single hybrid platform handles pub/sub, streaming, KV state, and request-reply — without platform sprawl
  • Clustering is full-mesh, cost-effective, and scales linearly
  • Observability covers both infrastructure health and end-to-end event traceability
  • Streaming patterns are chosen deliberately — consume, mirror, or merge — based on consumer requirements
  • And the edge can receive commands, apply configuration, respond to queries, and confirm operations — making it an interactive, controllable participant, not a passive sensor array

This is the system I had in mind when I wrote Living on the Edge: Eventing for a New Dimension. Not a pipeline. A fabric. One that carries information in both directions, enforces constraints at the topology level, and gives operations teams the ability to act on what they observe — in real time, at fleet scale, without additional infrastructure.

If you’re building toward this architecture and want to understand how the Synadia platform makes it operationally manageable across environments, the platform page is the right place to start. If you want to go deeper on the NATS primitives — request-reply, KV store, subject design — the Synadia education resources and NATS documentation cover the technical depth. And if you’re designing for a specific operational scenario, talking to the team is how you get from patterns to production.

Build things that respond, not just things that report.


This post is the ninth in the Living on the Edge series, based on Synadia’s white paper Living on the Edge: Eventing for a New Dimension. The full series: the edge as an operating reality · why retry logic fails at the edge · why edge security is a topology problem · why flow control is a day-one architecture decision · why platform consolidation matters · why your clustering model is your cost model · why approximate observability costs you · mirror, merge, or consume · and this post: the edge talks back.

Get the NATS Newsletter

News and content from across the community


Cancel