Factory Floor Intelligence: MachineMetrics, Edge Computing, and the “freedom” of NATS
With thousands of edge devices at hundreds of customer sites, MachineMetrics provides production intelligence for smart manufacturing with NATS and Synadia.
MachineMetrics ingests data from devices, sensors, and PLCs on factory floors, contextualizes it with other system data and then analyzes it, accelerating real-time insights into production processes. For example, tool monitoring. With NATS as the data backbone in its edge application, MachineMetrics helps its customers catch and avoid catastrophic incidents on their machinery, saving serious time and money.
Watch the recording to hear Jochen Rau and Tyler Schoppe of MachineMetrics break down their entire NATS architecture.
"The philosophy behind NATS was very attractive as it allows us to basically decouple where data is consumed and where we have our compute. We don’t have to care about where the data is produced and where it is consumed...NATS is the only technology in the space, really decouples the addressing of data from the access point of data. That separation is an absolutely unique and powerful feature in NATS that is underappreciated. Once you use it, it leads to a very 'free' architecture and high resiliency."
— Jochen Rau, Data Platform Engineering Manager, MachineMetrics
Go Deeper
Full Transcript
Andrew Connolly:
Okay. Hello, folks. I'm Andrew Connolly. I'm a member of the marketing team here at Synadia. In this next session, we're talking to long time NAS users and Synadia customers, MachineMetrics.
MachineMetrics provides an industrial IoT solution, giving manufacturers real time insights from their machine data. But I'm not gonna steal any more of their thunder. So with me today from Machine Metrics are Jochen Rau and Tyler Schoppe. And I'm gonna let them introduce themselves here briefly.
Jochen Rau:
Yeah. Hi. Thanks for having us. My name is Jochen Rau. I'm the engineering manager of the data platform team at MachineMetrics.
It's one of three teams. And, as you said, we're a IoT company, and we are, having we're connecting to thousands of machines at, hundreds of customer sites and factory floors. And NATS came in very handy, which we would talk about today.
Tyler Schoppe:
Yeah. And I'm Tyler Schoppe. I am a staff engineer, on Yohen's platform team, and also our in house, NATS evangelist.
Jochen Rau:
So, yeah, our tech platform basically is a traditional was a traditional AWS based, platform. We we use a lot of a and we're still using AWS products like Kinesis for streaming and RDS databases, s three buckets, and so on. And we have a lot of deployed edge devices, thousands of edge devices that are the machine connects to. Right? Those those little green boxes.
I have one here. And on those on those, we can't really run good pops ups or we haven't run a good pops up solution. And so we were searching for something that actually can work as Kinesis but on the edge device. And then we looked around, and NATS was actually because of the size of of the binary and, the the resources or not that much resource that it uses was very attractive. And the the philosophy behind NATS, was very attractive as it, allows us to basically decouple, where data is consumed and where we have our compute.
So we can move around very freely compute, and data is ubiquitous. Right? We can, we don't have to care about where the data is produced, and where it is consumed. We can we have this freedom. And that was, the initial point where we it's a different solution, different vendors, but NATS, from our perspective, had the right philosophy.
Right?
Andrew Connolly:
Very nice. I love that you have the edge device there right with you to show off. Makes for a perfect segue. So I think I described machine metrics as industrial IoT. Can you give us a little more background or color on what that looks like in practice?
What are the, you know, couple core use cases that that you serve, to your customers?
Jochen Rau:
Yeah. Our customer typical typical customer is a midsized shop factory floor. They have dozens of machines. Usually, they now, they have, individual operators on the machines. They might might walk around the machines to operate and provide material and and kind of set them up for a new product, for a new part that they produce.
But in in general, how they work traditionally is they have spreadsheets, they have paper to coordinate their work, to know what the machine is doing, if the machine is active, how well they actually utilize their machines. They're very expensive. Right? So they wanna, utilize them well, to to reduce costs. So and that is a challenge with, kind of very different vendors of machines, very different machine controls, and so on.
And that's where we basically traditionally come in. And and so we can connect to a lot of PLCs. Those are the controls of the machine. We can gather data. We have sensors that we can put in to to, example, count parts if that's not, delivered by the control.
So we collect a bunch of data from different machines, and we ingest that data. That's the traditional way. We ingest that data into the cloud, and then we contextualize it. We provide, connections to ERP systems. Those are the the the planning systems that that factory flows use on the higher level.
Right? So what what are the products set or what are the parts that we need to produce and delivery deadlines and so on. And, so we we connect to those systems, contextualize the machine data, and provide analytics on top of it. And, essentially, our customers, they log in to a dashboard and see a lot of analytical data. They can do, planning for their for their production.
They can do production planning, and steer their production and also be alerted if something goes wrong and so on. So that's what we essentially deliver. On top of that, we have also solutions for tool monitoring. So for tool breaks, that's very expensive. Right?
We wanna predict, so we can predict before tool breaks that it will break in the next few cycles so they can change it without having a catastrophic incident on the machine that costs a lot of money and time. Right? So we have that, and that's once foremost on the on the edge device as a compute.
Andrew Connolly:
Okay. So it sounds like the the big theme is these machines on factory floors throwing off valuable data of of all kinds, but it was was not easy always to get it you know, to get access to that data, to get that data to the right place. So, Tyler, it sounds like NATS was in place to some degree when you got to machine metrics. Can you talk about how how you guys have grown with NATS? Maybe how it was first used in the solution and how it's grown to do, you know, more different things?
Tyler Schoppe:
Yeah. Certainly. I mean, like we talked about, you know, we have so many edge devices out in the field, and that works really nicely with the leaf node model that NATS provides so that hub and spoke, type of architecture. So that was really one of the initial use cases. Right?
So having all of our edge devices have the ability, to communicate from where they are, back into our cloud. We've started to just extend and really build upon a lot of the different features within NATS just since it can do so many different things. And as we've started to build more distributed systems, it's just become that really core, mesh layer, for service to service communication And just being able to do so many different types of building patterns using one tool that's familiar just makes it easy for you know, once you kinda get the basics of, what you're doing with NATS, you can continue to just kinda move quickly, really handle, you know, any kind of different situation you wanna use. So a lot of it just started out as kinda doing some simple service development, with core NATS, and then that's certainly expanded into, some really pretty heavy JetStream usage, work queues, lots of different processing that we use, especially with the consumer model. Just makes it so easy to scale services, as you need, you know, up and down really with just one piece of code, that can just expand and shrink, as you need.
We've really started to leverage, key value as well. I mean, just so many of these tools that, you know, Jochen mentioned on, like, the AWS stack with Kinesis and ElastiCache are are things that we do continue to use just based on, you know, their prevalence within our architecture. But new development really is just all things that are on that's, you know, using JetStream key value.
Andrew Connolly:
Yeah. I think that's an interesting thread to pull on. How how do you guys think about decisions around replacing existing tools with with new tooling like NATS? Is it is it generally new development gets new tooling, or are there scenarios where you will rip and replace to get some benefit?
Tyler Schoppe:
Yeah. I think we've seen both in practice, right, where, you know, we need to be able to move quickly, as, you know, a small team. So, you know, the challenge of saying, oh, let's just replace Kinesis with NATS for all our data ingestion is, you know, a massive project that is something we are hoping to get to. But, you know, is it always worth the time right off of the, you know, the initial rip? But for a lot of new development, you know, we it's pretty much just NATS.
And then we've seen in some cases too when we wanna have higher levels of guarantees that it might make sense to use NATS over, say, something like Redis, just to get, you know, better PubSub guarantees with stuff that we have, like, on internal message brokers. And when we need to know, you know, exactly once guarantees, that's the tool that we wanna use for the job.
Andrew Connolly:
Mhmm. And how about the, interoperability? So when you have, a legacy stack and you're adding on new things, new you know, using NATS to add a new new add on new capabilities, do you find that things play well together? Has that been easy? Have you have you had to come up with strategies to to manage that or handle that?
Tyler Schoppe:
Yeah. That's a pretty good question. I mean, we've certainly used NATS in a few, you know, different stacks. We currently have NATS in production in JavaScript, TypeScript, like, environments, as well as some Rust and Python services. We're a pretty heavy JavaScript shop, and we have spent some time as a platform team putting together some intermediate tooling on top of NATS so that it kind of abstracts away a lot of the interactions with NATS.
So it can more be along the lines of we want access to this data or we want to produce data, and you never really need to deal with setting up a NATS connection or, even understanding, like, the specific subject that you wanna publish on. We have, tools that will say, like, this is the data that I'm working with. Here is where I wanna put it. Here is where I want to consume from just so that it's even easier for developers to work with, and helps us to maintain standards. Because one of the big lessons we've learned, with using NATS is that the shape of the data, whether that be subjects or payloads, is very, very important, and you need to be very intentful about how you set that up.
So we're trying to make that as easy as possible for our development teams to take it and run with it and not have to spend that much time in some of those trickier areas.
Jochen Rau:
Another example of where we interface with existing architecture or with existing infrastructure was the Kinesis stream that we have. We have several Kinesis streams like processing stream, but we basically take the end of that Kinesis stream and copy it into NATS. Right? So it's mainly consumed through NATS now, the whole stream, and that enables us actually to produce machine data directly into NATS on the leaf node. We've kind of bypassing the whole Kinesis Kinesis pipeline, and now we have a migration path.
We can say more and more machine data, more and more specific data, is on NATS then. It gives us this nice way to, gradually switch over. So we can we have now really more freedom, and one of our initiatives in the future will be to move our processing that is totally cloud based, right, so contextualization, more onto the edge device. That's one of the things that was the goal, and it NATS allows us to do.
Andrew Connolly:
Very nice. Can you say anything more about the motivation for wanting to do more processing at the edge, and maybe what what's changed that, is making that more of a reality to do it at the edge?
Jochen Rau:
Yeah. So for example, tool monitoring. Tool monitoring requires yeah. Usually, what we ingest is data at about one hertz, one one one data point per second per metric, so roughly, right, depending, on the machine, the control. But for for tool monitoring, you need about a kilohertz of data.
And that is usually not, given the bandwidth that factory floors have now, you can't move that into the cloud. It's it's not necessary. Right? Because you if you have the compute then on on the edge device that can look at that data, process it in real time, low latency, and then just send the the alert, up into the cloud, that's plenty enough. Right?
And so we we moved that. There was a requirement more on the on the data side, on the data velocity side. Right? But it's also resource usage. There's a lot of untapped resources on the edge device.
They are not high powered edge devices, but still we can move a lot of computing that we have usually in in a service in one pod running on Kubernetes. Right? We can distribute that on the edge devices, with no cost at all. Right? So it's also used to compute resources that are out in the field more, and and reduce the the cost in the cloud.
That's another kind of capability. And then it's also having individual deployments kind of customer specific. So that can be on a on an edge device, deployed to an edge device more easily. So it's kind of the important part is not that we wanna move a lot of compute to the to the edge. It's the freedom that we don't have to if you if we develop a service, we we don't have to think first and foremost about where it's gonna end up on the edge or in the cloud.
It's just it's just then constrained by resources.
Andrew Connolly:
Do you guys struggle or face challenges around, connectivity at all? Are there situations where edge devices are, you know, intermittently connected or disconnected, and and how does that work with your architecture?
Tyler Schoppe:
Yeah. I mean, we see basically, you know, every extreme, right, from customers who have great network setups to those who have very poor, network sets where the bandwidth is super low, you know, in troubleshooting scenarios or, with, you know, just local environmental things. Right? Like we've seen, with some of the fires in California or flooding in The Carolinas. Right?
Shops, you know, all of a sudden, boom, the power goes out, where and then when it comes back on, who knows if the network is available? And that's really starting to drive a big push. We currently don't use JetStream, on our edge devices, but that's something that we want to and we're planning, the work right now to get that going so that we can have that persistence layer so that the edge device can continue to capture data and, you know, keep going, as business as usual. And that, again, is, you know, a great use case for compute on the edge so that in the case that you don't have that connection back up to the cloud, you know, some features can still be driven, just within your local area.
Jochen Rau:
Yeah. And and, traditionally, we have, SQLite running on the edge device that buffers data. Right? But that is also breaking technology. Like, yeah, okay.
You put it in there. You take it out. You have to serialize it, deserialize it, and then put it up into the cloud. So it's kind of it's just a buffer thing. And with JetStream on on the edge device, it's the same technology.
You just write it to NATS, and it's gonna be persisted until it's reconnected. Right? And then it's gonna be synchronized up into the cloud. So there's no technology break anymore, and that's, reducing complexity. Right?
Andrew Connolly:
Mhmm. Yeah. I was gonna ask a question around, operator experience and developer experience, with regards to NATS. What how has NATS changed those experiences? Is there something that stands out as the biggest quality of life improvement for you and your teams?
Something that, you know, has has made, your day to day easier or different?
Jochen Rau:
I'd say I I'll let Tyler speak to that. I'd say it made it actually possible to pull off some of the things we do kind of the ease of use and operation of NATS, with a small team. And I couldn't I couldn't imagine for example, I looked in I worked with Kafka in the past, but you have to have a lot more kind of resources to actually operate the system just on the operational side, right, to to to do that. And with NATS, it was very easy to get into. Vanilla NATS, especially.
Right? It's just basically started. The pod run out, and you can connect to it, and it works. Right? So the operational overhead is so minimal for Vanilla NATS and also for JetStream is a little bit different depending on how complex the the setup is.
But, Vanilla NATS is so easy to set up and operate. We couldn't have done it, with a different technology.
Tyler Schoppe:
Yeah. I think in terms of you definitely hit the nail on the head with operating NATS. I mean, it's so easy to get it up and going. And when you run core NATS, you know, it's probably couldn't be really much more simple getting into JetStream. There's a little bit more that, of course, you know, you need to think about.
But when you do compare it to some of those other systems like you mentioned with Kafka, I mean, it's considerably easier to manage from that standpoint. I think from just, like, a developer quality of life, once you do learn the tools and you understand some of the development patterns, it just unlocks so much potential. And, again, like, using that same tool just to do everything, it just feels so familiar. You know, it's just kinda second nature. Like, oh, I know exactly how I'm gonna build this.
You know, just pull up a stream. You know, I can have a shared consumer to just handle, a scaled workload, add in some key values to distribute configuration or some short term caching. And, again, like, the architectural possibilities that we now have have really changed, just with our traditional stack. It's very directional. So from the edge to the cloud through some Kinesis streams, different layers of processing, but it's all data that's kind of headed in one direction.
Whereas now, you know, we really do have this, you know, ubiquitous data plane where where you produce, where you consume, you know, any type of compute can kinda live really wherever, which, like you had mentioned with tool monitoring, kinda opened up, you know, a new possibility for, something that we could do. But that doesn't mean that, you know, the compute that's done on the edge can't be done in the cloud. Right? Like, we can put it where we need it depending on a lot of different scenarios.
Andrew Connolly:
Okay. Sounds good. Maybe a different topic just to, you know, scratch some new surface area. Are you guys using NATS at all for anything front end, dashboarding, WebSockets, anything like that?
Tyler Schoppe:
Yeah. We're, that's actually a part of, some of our work that we're working on with the developer libraries that we're putting out to make that stuff more accessible, for the client side. We have a little bit of work to do, on our end with our account structure to enable JetStream access. But, yes, we issue short term NATS credentials through our APIs so that clients, as well as edge devices can have location specific credentials that they can consume their data. But that's certainly a huge use case that we're seeing our internal teams are really clamoring for, as we start to build, some more, like, micro or front ends, smaller front ends outside of our typical, UI stack where there's so much data that we have on NATS, and especially utilizing key values, where they don't have to go through kind of our traditional stack of, you know, just like a React front end going back to, like, our node back end through, an API layer that, you know, just your kinda traditional, like, MVC, architectural pattern.
But it's like we can skip that middleman and have access to so much data where we can just show customers things, build features that live really powered by NATS. So definitely a huge case, that we're looking into now.
Andrew Connolly:
So just so I'm following. Is that, front ends for customers consuming machine metrics, data, and analytics, or internal teams wanting to consume NATS data?
Tyler Schoppe:
Internal teams that wanna build things for customers is what we ideally want. And as an extension of that, you know, we have some customers who are, you know, power users and they want access to our data, where we've started to provide them with data, via MQTT. But there's really no reason that we can't just, you know, be able to issue, a user NATS credentials for their account and say, you know, here's sky's the limit. You know? This is your data.
Do whatever you wanna build on top of it. Because we do see so many of our customers have they wanna build something specific just for them, and they have development resources, and there's no reason we should tell them no. But, really, we wanna be able to say, here's all your data. You know? Build what you want with it.
Jochen Rau:
We have a handful of customers that consume MQTT data directly from NATS, basically. And that's also a really nice feature because most of the IIoT space is dominated by MQTT. Right? So, yeah, having that basically out of the box for free is incredible. Right?
Andrew Connolly:
Yeah. That's an overlooked and perhaps under marketed feature that we are working to to say more about NATS as an as an MQTT broker. What has NATS unlocked for your product or platform road map that, you know, is coming in the future and is only possible because of of NATS?
Tyler Schoppe:
When we talk about distributing compute and just what is opened up using NATS as a transport layer, especially when it comes not just to data, but, like, you take, you know, a WASM file. That's something that can be sent over the network. And now we're talking about pieces of code that can just be brought down to an edge device, something small. So rather we were just talking about rather than just, like, you know, OCI pulling from AWS ECR to bring down a massive Docker image, you have a piece of code that could live in a NATS object store, and then that can be just distributed everywhere automatically, like down to edge devices. So, you know, you would need this piece of compute there, first aid tool monitoring, and boom.
It's right there. Something that even in a low network scenario is doable because we're talking kilobytes, not a hundred megabytes with, like, a docker image. So, yeah, it's just, like, it's just so cool.
Andrew Connolly:
Is there anything in particular about NATS that you you see as the standout capability? Something that define NATS defines NATS to you or that you couldn't live without that that makes it that gives it its unique magic?
Jochen Rau:
I have something a little bit obscure, which is the NATS, I think, as the only technology in that space really decouples the addressing of data from the access point of data. I didn't if you think about a classic rest RESTful endpoint API, you the identifier, you say, this is the data I want, is the same thing as the access. Right? There needs to be a server that is behind the URL, and the URL itself is a URI. Right?
It's a it's an identifier for the data that you want. Right? And, that led to all kinds of conflicts, I would say, or suboptimal architectures. Right? And with NATS, you have that separated.
You give it a set of a set of server connection URLs, and it can round trip. It can gossip, which which servers can you connect to. So that's very resilient. And then the subjects are actually addressing the the the subset of data that you wanna look at, right, that you wanna consume. And that separation is an absolutely, I think, unique and powerful feature in NATS that is, I think, underappreciated.
And once you use it, it leads to a very kind of free architecture and a high resiliency. Right?
Tyler Schoppe:
Yeah. If I something along those lines probably would have been what I would pick. You know, I've mentioned it a little bit earlier talking about, you know, just, you know, just pushing data in and pulling it out from somewhere. But I think, alternatively, decentralized auth model has provided us with just a whole mess of benefits. It's certainly a topic to spend some time making sure that you understand thoroughly, but it really just unlocks so much.
I mean, for us to be able to have that separation of customers' data, so an account per customer, but then be able to map things back into, an account that we own have access to just all the data, how easy it is to just manage, you know, with scope signing keys to then issue users. You know, you wanna revoke something. It's really just as easy as making an administrative change, and you see things happen downstream. It's it's a really, really powerful tool to have at our fingertips.
Andrew Connolly:
Very cool.
Jochen Rau:
Mhmm.
Andrew Connolly:
My final question. So you you highlighted decentralized auth as a perhaps tricky and and, you know, carefully configured bit of NATS. How have you found both the NATS community and ecosystem for getting into these deeper topics, and how has partnering with Synadia been beneficial in in, you know, getting getting NATS to prod, getting NATS to scale faster?
Tyler Schoppe:
Yeah. I mean, I think within the NATS community, there are a lot of great resources. The Slack channel can be a good place to ask some quick questions. But through, like, the Rethink Connectivity YouTube series, is a place that I have referred to many times and and still do, just for, like, quick refreshers, especially on decentralized auth. The video there is fantastic for, you know, helping you, to get started.
And that's only gotten better with our, partnership with Synadia. You know, the team of solution architects is, of course, unbelievably knowledgeable, and they're just great people to work with and super quick to just respond to a Slack message and you know, with some great insights. And they've already been very helpful for us as, you know, reworking some of our auth model is a big project that we're working on right now.
Andrew Connolly:
Fantastic. Well, Tyler, Jochen, thanks so much for your time today. I really enjoyed learning about NATS at MachineMetrics. It sounds like you guys are using NATS really to the fullest, and I look forward to to checking it in the future and seeing how how JetStream at the edge is going for machine metrics. Thanks so much.
Jochen Rau:
Yeah. Thanks for have thanks for having us.