In this blog, I’ll review how NATS makes operating distributed AI applications more efficient.
AI inference at the edge is increasingly possible thanks to GPUs on edge devices. Those GPUs may be a significant investment, but, paired with NATS.io, they offer serious efficiencies and ROI.
NATS.io is increasingly being used as a critical infrastructure by companies doing AI inference in the cloud at scale - including the operators of the world’s largest distributed GPU cluster spread across 73 countries https://x.com/inference_net/status/1883293387719409884 rely on NATS to intelligently route inference requests, with direct impact on both performance and cost.
In these deployments, NATS.io is used to transmit and intelligently distribute requests for inference from the external-facing API servers to the hosts with (expensive) GPUs doing the actual inferencing calculations. Two key NATS features are central here:
Distributed request/reply for low-latency, interactive inference.
CQRS-like distributed processing with Streams, used as dynamic work queues.
Service latency - especially “Time To First Token” - is critical and a primary metric of customer service. When offering a global AI inference service, in order to keep this latency in check, providers deploy multiple customer service points of presence in multiple regions of the world (and some times from different hyperscalers) such that customer requests (i.e. prompts) are directed to the closest regional point of presence, for example through DNS redirection services such as Route53.
The API gateways in a region will then send those prompts to the GPUs located in the same region, for example by leveraging NATS.io’s request/reply functionality.
So far so good, this allows you to do AI inference distribution to your GPUs with geo-affinity and fault-tolerance (if all the GPUs in a region go down, the requests being transparently routed to other regions over the NATS.io Super-Cluster). However, you cannot control the usage patterns of your customers: at any given point you may have a burst of requests being sent by the customers in one region, and very few requests sent by customers in another region. This means that customers in one region see increased latency, while you have GPUs sitting idle in another region.
And now, thanks to the new ‘priority groups’ features introduced starting with version 2.11, NATS.io can now go, and can have a direct impact on the bottom line.
These new features mean that you can go beyond simple distributed request/reply with geo-affinity and allow for the requests piling up in the work queue stream for one region to be automatically and transparently consumed by workers in other regions:
This ensures higher GPU utilization across the fleet while keeping latency (including tail latencies) in check.
The higher utilization of the expensive computing infrastructure that you rent, while increasing customer satisfaction by keeping Time To First Token latencies in check (especially tail latencies) can have a direct impact on your infrastructure costs as you may not need to rent as many GPUs per region in order to meet your SLAs (you don’t need to ‘size for the peaks’ as much as you would otherwise), and give you a competitive advantage over competitors.
Diagram showing overflow consumption from a stream used as a work queue between two regions (can be expanded to more than two regions).
Cloud inference brings its own challenges: expensive GPUs, unpredictable customer demand, and strict latency expectations. NATS helps organizations turn those challenges into opportunities by enabling intelligent request routing, workload overflow handling, and efficient utilization of global GPU fleets.
For providers, this translates to:
In short, NATS helps inference providers balance performance, cost, and scalability—turning unpredictable workloads into opportunities.
News and content from across the community