Search
Follow me:
Listen on:

Day Two Cloud 173: Istio Ambient Mesh Minimizes Sidecar Proxies

Episode 173

Play episode

Today on Day Two Cloud we examine Istio Ambient Mesh, a new option for building service meshes in a microservices environment. Istio Ambient Mesh essentially brings the concept of a load balancer to a cluster of containers. Rather than run a sidecar proxy for each pod or container, you can run Ambient Mesh per node. Our guest and guide to this open source project is Christian Posta, Global Field CTO at Solo.io.

We discuss:

  • Differences between Istio and Istio Ambient Mesh
  • Drawbacks of the sidecar proxy model
  • The architecture of Istio Ambient Mesh
  • Security and network processing
  • Concerns such as noisy neighbors and latency
  • Hybrid deployments with and without sidecars
  • More

Sponsor: Kolide

Kolide is an endpoint security solution that helps your end users solve security problems themselves. They get smarter about security and you get more compliant computing. Find out more at kolide.com/daytwocloud.

Show Links:

Introducing Istio Ambient Mesh – solo.io

Introducing Ambient Mesh – istio.io

Get Started with Istio Ambient Mesh – istio.io

Ambient Mesh Security Deep Dive – istio.io

Blog.christianposta.com – Christian’s blog

@christianposta – Christian on Twitter

Transcript:

 

[00:00:00] Ethan: Sponsor Kolide is an end point security solution that helps your end users solve their security problems themselves. They get smarter about security and you get more compliant computing. Find out more@Kolide.com. Slash daytwo. Cloud. That’s Kolide.com. Slash day two. Cloud.

[00:00:25] Ned: Welcome to day two. Cloud. Today we are delving back into the world of service mesh and this time we are exploring istio and their ambient mesh. What does that mean? Well, good thing we have somebody very awesome to help us guide us through the process. It’s Christian. Posta the global field CTO from Solo IO. Ethan, what jumped out at you about the conversation?

[00:00:50] Ethan: That although we get our heads all wrapped around terminology and what is a sidecar? And how do proxies work in the Kubernetes world? At the end of the day, folks, that have been around the industry for a while and know how load balancers work. If you start with that as kind of your architectural grounding, you can map that knowledge onto this world. And even more so with ambient because we’re kind of killing the sidecar. Ned.

[00:01:12] Ned: Yeah, it might be that it’s death is nigh and that’s what we’re leaning into in this episode. So enjoy our episode with Christian Posta. Christian, welcome to the show. We’re excited to have you here to chat about istio ambient mesh, which I have to admit, first props, that’s a pretty good name. I don’t know how much input you had in it, but I know naming is hard, man, and you nailed it on the ambient mesh. Can you give us the 10,000 foot view of what istio ambient mesh is all about?

[00:01:45] Christian: Yeah, absolutely. And first of all, thank you for having me. Happy to talk about this. Been working on service mesh for quite a while and this is sort of the next phase or next evolution, especially what we’re doing in the istio community. So Ambient mesh is an optional new data plane for istio that doesn’t require using a sidecar deployment to get the benefits and the functionality of the SEO service mesh. And now there’s a few reasons for that. We can go into those reasons. But it is interesting because I’ve been a part of istio since the very beginning and we’ve seen its adoption and use in various organizations, native modernization efforts and so on. And even so, I feel, and we’re seeing at least on the solar side, istio is probably the most widely deployed, especially at scale, large deployments, service mesh out there. There’s still more to go, there’s still a lot of opportunity for people to take advantage of this technology. And Ambient is a big reason why we built this is because of that. There’s still a lot more opportunity for people to use a mesh.

[00:03:16] Ned: Forgive me because I don’t know the complete history of istio I kind of jumped on a little bit later. Was it always a sidecar deployment from the beginning? Was that the original model?

[00:03:28] Christian: Yes, from the very beginning, it was a Kubernetes first type deployment. So it started with the assumptions that you see in Kubernetes. And as a sidecar it would deploy specifically an Envoy based proxy that ran as the sidecar, which is basically just another container that complements the main workload container. And in a Kubernetes world you can deploy a Pod, which is a way of scheduling multiple containers together. So you can schedule your workload and always schedule this sidecar proxy deployed next to it, which then could enrich the networking on behalf of the application workload.

[00:04:23] Ned: Okay, so for breaking things apart a little bit, you talked about the data plane and the control plane, the Envoy sidecars. Those are the data plane. They’re sitting in line, the data is passing through it. And then there’s the Istio control plane that’s kind of sitting over it, making sure those sidecars are provisioned and that they have the right rule sets. I don’t know if I’m using the right terminology, but something else.

[00:04:44] Christian: Yeah, exactly. The sidecar is the data plane. That’s where the networking or the traffic or the requests from the applications are flowing. When an application talks to another application, it first goes through its local sidecar proxy or data plane. And the data plane might do things like implement a request timeout. So this request goes out over the wire. It shouldn’t take more than this number of milliseconds or it will do things like request level, load balancing, so it’ll talk to back endpoints and load balance accordingly, but does what we call application networking on behalf of the application so that the application doesn’t have to do it. And then the control plane is taking the either developers or platform owners, some of the end users, people who want to describe what the networking policy should be, what the behavior for these applications on the network should be. And you typically describe that in a declared some sort of declarative configuration. In Istio that’s a YAML, Blobs of YAML and then the control plane will convert that into control configs for the data plane. So specifically it will convert it into Envoy proxy configuration and deliver that dynamically to each of the sidecar Envoy proxies that run in the mesh.

[00:06:20] Ethan: So trying to distinguish between ambient mesh and traditional service mesh. Then there’s a few points that you made. One, you were trying to encourage adoption by a wider audience is kind of how I heard it sounds. A lot of people there that could still benefit from a service mesh sort of architecture and ambient is going to help them do that. So what are we saying? We’re not going to do sidecars with the Pod model or what are we doing stepping back from Kubernetes as a deployment. But still, what a fancy way to do proxies. What does the architecture look like now?

[00:06:53] Christian: Kristen let’s look at it in the Kubernetes standpoint first. If you want to deploy your application, then you deploy your application, you don’t have to worry about, all right, well, here’s my application. I have to wrap it in all this other stuff, this other YAML or set up, because Kubernetes can do automatic mutation of a deployment. So you deploy something and it can add stuff into it. In Istio, we have a sidecar injector you deploy an application, it will automatically inject the sidecar, which then changes your application. That went through CI CD, and now when it gets deployed into the runtime environment, changes a little bit. And it might not be that much, but it does. And so can we minimize that changes? Minimize it? Because when you make changes, you’re introducing some risk. And so can we minimize that to basically just deploy applications and then apply this service mesh behavior, the load balancing, the MPLS, the telemetry collection, et cetera, can we get the capabilities of the service mesh without having to muck around with the applications themselves? And that’s the ambience. It’s just kind of floating out there somewhere. You can still get those benefits, but without perturbing or messing with the applications themselves.

[00:08:27] Christian: So that’s kind of the foundation for that.

[00:08:30] Ethan: So we’ve still got a sidecar conceptually, well, what do we have? Because we’re saying it’s not a side car.

[00:08:36] Christian: The proxies still need to exist somewhere, right? And they still provide the functionality and the capabilities of the mesh. Because if you look at istio, Envoy proxy is the core engine that provides the whatever happens on the network. It’s from Envoy, so that still lives somewhere. But from an application standpoint, you don’t see it. You don’t need to know about that. You don’t have to change your deployment and deploy these platform components along with your application to get this capability. You just deploy your application.

[00:09:17] Ethan: Well, it feels like an old school application load balancer architecture. Then, where you build an album, you’d have virtual IPS that have a bunch of functionality that are baked into them, and then behind them would be a pools of application servers that would serve up the data.

[00:09:38] Christian: Yes. And that is more of that is to the application. It looks like that. Okay.

[00:09:49] Ned: Yeah. And just to clarify a little bit, because I think this is sort of a nuanced point you’re making when you’re deploying it in the sidecar model, that sidecar is part of your deployment manifest and the way that it’s configured. And so you got to get that right as the deployer whomever you are, maybe you’re part of the application team. Instead of decoupling a little bit and saying, no, I just want to be able to deploy my application and then have the service mesh transparently, almost take care of whatever the additional functionality is. And if I need to make a change to the service mesh, I’m not altering my application deployment, I’m altering service mesh configuration.

[00:10:32] Christian: Exactly.

[00:10:34] Ned: That’s nuanced, and it’s something I wouldn’t have thought of but I’m assuming especially for large customers that are running at scale, that invasiveness of having the two tightly coupled together becomes a real problem.

[00:10:50] Christian: It becomes painful. Yeah. Especially as you continue to add more service mesh. The mesh and it’s that day two, right? That lifecycle. How do you patch the mesh when there’s CVEs found? I mean, generally ISTA releases once every quarter. That’s actually quite quickly for a lot of these organizations to adopt changes. There’s a lot of that. We’ve deployed the mesh, we’ve gotten to it running, now we got to keep it running and how do we alleviate that pain. And so that’s also an avenue of making it possible for people to run the service mesh because the sidecar model certainly at scale. It doesn’t have to be that way. I guess is it a different way of saying it. We want these capabilities, but up until now, the sidecar, I heard somebody who was it? Maybe is was Matt Klein somebody said it was an inconvenient thing that we happen to do this, but it doesn’t have to be that way. Ambient mesh just shows that it doesn’t have to be that way.

[00:12:07] Ned: Right. And if I want, like you said, if I have to upgrade my version of Envoy and all these sidecars, I have to redeploy each pod.

[00:12:15] Christian: Exactly.

[00:12:16] Ned: This pods are immutable in sense. So yes, that’s a lot of rolling changes happening. Not because something changed my application, but because something changed about exactly.

[00:12:29] Christian: Yeah.

[00:12:30] Ned: Got you. Okay, so I think now we’re ready to get down into a little more of the meat of what does ambient mesh do that’s different from the sidecars from a deployment perspective.

[00:12:43] Christian: The first thing I want to point out is that we’re at a point in time of ambient mesh life. We just announced it a few weeks ago and there are still some known gaps between what ambient does at this point in time and what Istio does with the sidecar. Now at Solo, specifically, Solo and Google worked on this and announced it and opened it up to the rest of the community a few weeks ago, like I said. But we worked on this very closely. We know what those gaps are. We are working with some early adopter customers of ours who will also continue to push us and push to close those gaps. Where we are today is not where we’ll be in a few months. Let’s say. The capabilities of the mesh, we are shooting for parity. We are continuing to harden and optimize the different parts of the deployment, what it looks like today. And that’s going to continue to happen.

[00:14:00] Ned: Right. It’s constantly evolving and like you said, it was introduced a few weeks ago as of this recording. So when it goes out, it might be out for have been out for a few months, but still that’s very young in the product lifecycle.

[00:14:15] Christian: Yes, it’s young, but it is especially since we’re working I can see what’s happening I’m very confident that we’ll get this to a good spot pretty soon.

[00:14:31] Ethan: We’re taking a short break from the podcast to tell you about sponsor Kolide Kolide is an EndToEnd security solution and they use a resource that most of us in it would never really think about the end users because end users were problem start, right, not solutions. Well, Kolide challenges that thinking because if you can leverage your end users to mitigate the security issues that they are carrying around in their backpacks that is a huge win. Now, let’s say you’re doing your device management the traditional way with an MDM. Well, you know the joy of loading agents on to employee devices. Agents impact performance, and they can be a privacy horror show. Privacy being a thing all your users know about now. So Kolide does things differently. Instead of forcing changes on your users, Kolide notifies folks via Slack when their devices are insecure and then provides step by step instructions on how to solve the problem. And using this Kolide approach, the interaction feels more friendly, more educational, more inclusive and less intrusive. Because now it isn’t doing something to your device. Instead you’re working with it to help keep the company secure. It’s the whole attitude of we’re all in this together and as it, you still get the views you need into the managed device fleet.

[00:15:46] Ethan: Kolide provides a single dashboard that lets you monitor the security of everything whether the endpoints are running on Mac, Windows or Linux so you can easily demonstrate compliance to your auditors, customers and the C suite. Give Kolide a shot to meet your compliance goals by putting users first. Visit Kolide.com slash day two cloud. Find out how and if you visit Kolide.com slash daytwocloud, they’re going to send you a goodie bag, including a Tshirt just for activating a free trial that is Kolide.com slash daytwocloud. And now back to today’s episode will.

[00:16:23] Ethan: Tell us about the architecture of the Ambient mesh because it sounds like if we don’t have feature parity today, you didn’t just pluck it out of the sidecar model and then redeploy it in a different model and it was easy. It sounds like there’s more to it than that so talk us through the.

[00:16:36] Christian: Architecture so we wrote a blog about this back in December 2021 that discussed some of the trade offs that you have to make if you pull the sidecar out, what can you gain from that? What might you lose from that? And so ambient represents what’s the right balance for what we think Service Mesh users are looking for. And a lot of what they’re looking for starts with, hey, I want to be able to get mutual TLS and apply some network level policy about what services can or cannot communicate with each other that’s looking at it from a security standpoint or from a compliance fulfillment. How do we implement some of the requirements around compliance that a service special can fulfill. And then from there it’s traffic routing, observability. These are all really interesting use cases that people might adopt, but for the most part it seems the security aspects are top of mind. And so the first thing we did with the data plane when we pulled the sidecars out there, we got this proxy that in the sidecar mode. The proxy could do everything. It can do meter TLS, it can do header based traffic routing, splitting between multiple versions of services, load balancing, flex a lot.

[00:18:11] Christian: The sidecar for the mesh was responsible for everything that a data VXLAN can do. The first thing we did is we said we pulled that out. We don’t really want to share a single one of these proxies that does everything for all of the workloads. I think you said you had William Morgan on a previous show, the Linkerd folks, when the original Linkerd, I think it was One X, was exactly this type of proxy. It was a shared node, do everything for everyone type proctor that ran one per host or one per node. And doing that you run into all kinds of issues like well, when you try to share layer seven policy across all these applications, you start to see collisions, noisy neighbor problems, security boundary problems and so on. So we knew we didn’t want to do that. And so what we did was we split up the capabilities of what a proxy does into multicloud layers. Okay? So we said instead of having a proxy that can do everything, why don’t we have a layer that’s just focused on ends up being layer four. But really it’s just focus on implementing Mutual TLS and network level authorization policies without everything else.

[00:19:47] Christian: Let’s start there. Start it at this foundational there and then we can build the stuff rested on top of it.

[00:19:54] Ethan: So this would be a proxy instance running that is handling just this. And if you need more functionality than we could chain over to another proxy exactly.

[00:20:05] Christian: So Ambient does have we do share I’ll call it proxy right now, but it doesn’t have to be a proxy and I’ll explain that in a second. But there is a proxy, an agent, I would call it a layer of security overlay agent that runs one per node and all it is responsible for is opening connections and assigning certificates to those connections so that you can get Mutual TLS. And then from there you can write policies about what workloads are allowed to talk with what other workloads based on that cryptographic identity, which is what the sidecar does today. But we’ll move that into its own layer which is deployed as an agent that runs, colocated or runs one per node. That’s sort of the foundational layer of ambient mesh.

[00:21:09] Ned: Okay, so that’s strictly like you said, layer four type traffic. It’s making decisions regarding that and I guess integrating with whatever container networking interface you’re choosing to use.

[00:21:21] Christian: Yeah, exactly. So that’s where things like the CNI and what this agent, what it’s doing, these things start to blur the lines a little bit.

[00:21:32] Ned: Yeah, I was going to say, it does seem like you mentioned you can proxy without a proxy. And then almost like depending on what type of CNI you’re using, maybe that could fulfill that functionality.

[00:21:44] Christian: Maybe the CNI can help out. For example, today at the time of recording and at the time of release of ambient mesh, this agent component does happen to be approximately it is ongoing, but we only use the layer four capabilities and ongoing. But it doesn’t have to be ongoing and it probably or maybe won’t be Envoy going forward. We are experimenting, we are looking at a few different options. So for example, in the Istio community, we’re looking at maybe a purpose built component that does just these three things, that being UTLS, telemetry collection and some basic Air Force load balancing. Just those three things. Could that be done in a dedicated component that’s built in rust? Could that be done in dedicated implementation in eBPF? At least here at Solo, we’re very interested in the Cilium project. Could we have Cilium take care of a lot of those pieces? So it happens to be Envoy today, but that may or may not be the best solution going forward.

[00:23:00] Ned: Okay, in terms of benefits, moving the traffic down to being processed at the node instead of the pod, what are the, some of the benefits that you’re seeing over doing it in a Sidecar?

[00:23:14] Christian: Well, the first being you don’t have to inject a Sidecar. The applications don’t even know that it’s there. If that’s the case, then if I need to change my application or upgrade any of the components in the mesh, the application doesn’t know that. Some of the secondary benefits are if you’re not running all these Sidecars, you don’t have to pre provision or account for the memory CPU resource overhead that the Sidecars might end up using. And so you save on some of those costs ahead of time. A third thing is in terms of security. So in the Sidecar model, if your applications and hopefully they’re not, hopefully you have scanning and continue scanning and all that stuff. But wherever there’s a lot of code and complex code, there’s potential for vulnerabilities and so on. In the Sidecar model, if your application somehow is compromised, the Sidecar will be exposed to that vulnerability. So potentially you could take over the cryptographic materials that are in the Sidecar or the tokens that are used to get that material and then do things, let’s say, from there. In the ambient mode, if you compromise the application, you don’t get access to the service mesh proxies, they’re not running in the same pod.

[00:25:03] Christian: So we pulled out any of the platform pieces that may have sensitive material, a key material, certificates, whatever and separated that out from the applications.

[00:25:19] Ned: Okay, that does sound like a lot of benefits. And I remember one concern that I heard previously was implementing Envoy at the node layer. It was never intended for sort of this multitenant almost configuration. So there were concerns about dealing with noisy neighbors or cross traffic inspection or something along those lines just because of the way that it was originally thought of and constructed. So how are you dealing with those security concerns at the node level?

[00:25:55] Christian: So the concern around something like Envoy, so just the shared node architecture and tenancy multitenancy noise neighborhood, all that stuff really stems from the layer seven capabilities that it potentially could be doing. And the fact that each application will want to configure its layer seven processing and handling differently and trying to understand and reasonable and even assign resources to this component that can be so divergent and wildly different. Maybe somebody wants to inject a because Envoy can do this, a WebAssembly plugin, which is a custom plugin that operates on the request and let’s say that plugin kills Envoy or something, right, does something that will affect everything else running on that property. And so from a tenancy standpoint at layer seven, that’s just a problem waiting to happen.

[00:27:05] Ned: Okay, that makes sense. What about the noisy neighbor issue? If you just have pods in a particular namespace that are just sending a ridiculous amount of network traffic, can you throttle them by namespace or something like that to make sure they don’t overwhelm everything else that’s running on that node?

[00:27:23] Christian: Well, so the concept of namespace and the node are kind of separate. But if you have, let’s say you have 100 pods running on a single node and those pods are sending out a bunch of traffic and if the question is how will that affect the proxy or this agent running locally on that host? The answer is it would. But it wouldn’t be much different than any of the other layer three layer four components on the data path for that note, right? You also have Linux there and file Descriptors and Sockets and connection tracking. All this other stuff that happens at layer four. If you take those into account and size them appropriately, then we should be able to do the same thing with the layer four properties that run on that number. Because you’re not going to get because then it just comes down to what’s the number? What’s the number of connections, what’s the traffic? What’s the number of bytes flowing through instead of well, I need to parse this stream, I need to save this in memory. I need to set retries to this and all of this complicated stuff. Like I said, that can add cycles, that can add difference in behavior and it’s a lot more difficult to reason about that delayed seven stuff versus hey, we’re opening a socket pushing supply soon.

[00:28:51] Ethan: Stupid basic architectural question here, Christian. But it does feel like we’re still talking about this all in the context of Kubernetes cluster. It’s just that we’re taking these components and running them. The sidecar is no longer in the same pod. It’s separate. Okay, so I’m right, we’re still in the cluster then. You made a comment earlier that we don’t have to worry about resources and noisy neighbors and all that stuff. But we do still. Right? There’s still a concern. We’re still placing load on the cluster and we’re still handling thousands upon thousands of connections and state tracking and all of that stuff. So what’s the difference here as far as resource management goes that we’re getting a win?

[00:29:35] Christian: I mentioned that in terms of in the side farm mode, you not only have the connections that the applications are making, but you also have the configuration that each of the proxies need from the control plane. There’s traffic going back and forth from the sidecars and the control plane. Right. But now if you have fewer total number of proxies, you can have fewer of these connections. You have the memory that each of the sidecars needs to account for that make up the configuration, load balancing configuration. Also the envoy config can be pretty verbose and then all of the stuff can be tuned, but you need to have a time provision for it for this to be able to run the proxy correctly.

[00:30:36] Ethan: But you’re getting back because you’re not running nearly as many instances of the proxy. You’re getting back all of that baseline that you would have had to allocate per instance because we’re running, I assume it’s far fewer. It sounds like we’re talking, we’re talking one proxy per layer, or are we talking multiple proxies per layer depending.

[00:30:54] Christian: On loads of the layer that we’ve been talking about so far up until this point has been the layer four part, which in the istio ambient parlance is called the Secure Overlay layer. That’s the layer we’ve been talking about. And we can talk in a second about what layer seven looks like because that will involve the layer seven proxies. But still, if you think, well, I have 100 instances of service A, I’m going to need to make sure that when I schedule that in Kubernetes or deploy it anywhere that I have the resources for 100 sidecars to be able to run also. And if you just need mutual TLS and you just deploy the ambient Secure Overlay layer which is made up of these layer four agents, then you’ll have one agent per note, not one proxy per instance deployed across multiple notes.

[00:31:50] Ned: Okay, that could be a pretty significant win because I know usually when you’re specking out how big the sidecar has to be is you have to plan for worst case scenario.

[00:31:58] Christian: Exactly. Yeah, exactly.

[00:32:01] Ethan: Well, you said we were going to talk about layer seven. I think the time is now. Christian, take us up to layer seven processing and explain how that works.

[00:32:09] Christian: Yeah, in the ambient data plane for layer seven, what we do is we deploy what’s called Waypoint proxies and a waypoint proxy is deployed, or can be deployed one per service account. And in Istio in Kubernetes, we use service accounts to specify identity. For workload applications, we can specify one waypoint proxy deployment per service account. This means that if I have service A, B, and C, each one of those would have their own service accounts. Each one of those services would have their own waypoint proxy. So now when traffic, let’s say A wants to talk to B, traffic will go from A, it’ll first go to the Secure overlay layer, which that will enable Mutual TLS and handle, therefore and then from there it would go to base talking to B. The traffic would then go to a waypoint proxy that represents B. Now B waypoint proxy would terminate the connection, do the Mutual TLS handshake. It will apply layer seven load balancing. So Request Level Load Balancing, it would do things like parse HP headers and make traffic routing decisions based on those headers. It would implement things like Request Level Retries or Request Level Timeouts and then eventually pick a backing service B, which lives on one of the nodes somewhere, and then it’ll open up another connection that ends up being Mutual TLS with that back end service B.

[00:34:14] Christian: So I think you made a comment earlier, this starts to look like ALB or Load Balancer type architecture, and it does at layer seven. Applications don’t know anything about that and they can get the benefits of usual tails, authorization policies, these types of things.

[00:34:33] Ethan: So how are we doing the hopping between layers then? Because we go from layer four and get that work done and then we move it up to layer seven if we need to. Are we doing an IP level chaining or is it actually more like we’re within the proxy processing and we’re handing it off to a different process?

[00:34:49] Christian: No, it will go over the network.

[00:34:53] Ethan: Because we’ve got a different proxy instantiated for each layer. Okay, so there’s some fancy header rewriting and state tracking that’s got to be going on then.

[00:35:03] Christian: Yeah, so the traffic will go from the layer four component to the waypoint proxy using an Http two tunnel, and we can attach metadata to that as we tunnel the traffic over to the waypoint proxy.

[00:35:23] Ethan: Oh boy. Okay.

[00:35:26] Christian: So that’s all stuff under the covers. Yeah, exactly. We’re moving how we track connections and apply policies and implement layer seven, et cetera. We’re moving it to a place that applications don’t know who care about. Right. They just care about getting those things implemented.

[00:35:47] Ethan: Right, and it sounds like Ops people, the people that are operating the mesh don’t necessarily need to know about it. We could assign a policy and say we’re going to go from here and we know seven. So we’re. Going to go over here. But the actual we don’t have to configure a transport mechanism to get it between layers. That just happens as a function of his two ambient.

[00:36:04] Christian: The control plane does that. Control plane does all that.

[00:36:08] Ned: It reminds me a little bit of the deployment architecture I’ve seen in AWS where they have the network load balancers, the NLP that run it, layer four super fast, efficient, right? But then oftentimes you’ll see that being the top level and then below that you’ll have application load balancers that can do like web application firewalling and all the advanced layer seven stuff and have one of those per application that’s being hosted. So you can do the super fast layer four stuff at the network load balancer and then down to ALB. It’s not a perfect analogy, but it’s similar for anybody who’s worked with an AWS and seen that pattern before.

[00:36:46] Christian: Yes, that’s a good way to conceptualize it. And you get the same thing like in the ambient mesh world, if you just need layer four, then you bypass any of the layer seven processing, which layer seven can be expensive. If you don’t need it, there’s no reason to parse the request stream. Just stay in the layer four world and you get super fast mutual TLS authorization policies, a big chunk of what people come to the service mesh for in the beginning. And then if you need layer seven, which you likely will at some point, then you opt into it and it becomes a separate layer.

[00:37:24] Ned: Okay. And so that waypoint deployment becomes part of my manifest for deploying my application.

[00:37:33] Christian: No, it ends up becoming right now in Istio ambient how you deploy the Waypoint proxy. Is there’s a YAML file or you could just deploy using a Kubernetes deployment. But there is a little helper proxy deployment in istio out of the box where you deploy this little deploy YAML that will then deploy the Waypoint Proxies. Now at Colo, we’re working on taking a little bit better control over what that lifecycle looks like and we’ll bring some of that stuff to Open Source Seal as well because there’s going to be a lot of work on that area. But we have customers that we know want to be able to control exactly where those Waypoint proxies run, either for the locality reasons or they want those proxies to run on dedicated nodes for a little bit more deterministic performance. So we know that there’s going to be some control over where those proxies azure. Otherwise those proxies are just Kubernetes deployments and those proxies would float on any node. But getting more control, more granular control over that is definitely something we’re working on.

[00:38:46] Ned: Got you. I think what Ethan was mentioning is in terms of who’s actually going to be deploying these proxies, do you envision it being application teams as part of their application planning or do you think of this as more the ops folks deploying the waypoint proxies in front of an application. Just getting some of the requirements from the app team or preview.

[00:39:11] Christian: And this is what we’re kind of trying to work toward. Ideally the application team would say I want two retries when I call this service or I want request level traffic splitting because I’m going to do a Canary release and that’s all I care. I was advocating I want this and then underlying infrastructure, underlying platform will go take care of making sure that the layer seven proxies are all deployed in the right spot. In that way the developers, SRE, whoever it is, would drive the intent of what they want from the system and then platform with the skill deploy those proxies.

[00:40:04] Ned: Got you. And in terms of deployment, you mentioned that each waypoint is linked to a service account. And I’m probably going to ask a stupid question here because I don’t remember, but can you have more than one waypoint proxy per namespace if you have more than one application deployed in namespace with different service accounts?

[00:40:26] Christian: Yes, exactly. That’s why I said it very specifically like that because I think what is common or what I’ve seen to be common, maybe not best practice, but to be common. I’ve seen people that are not using a service mesh. They’re just deploying their applications into a namespace and they’re not always defining separate service counts per application or per workload type. Okay, so if they do that, then they share one service account, the default service account for that namespace for all of the applications in that namespace. Now if that were the case, then you would have and you deploy an ambient you would have one waypoint proxy that represented the default service account, which also happened to be where all of your applications are running in that namespace. You effectively have one waypoint proxy for namespace, but it’s considered good practice, especially if you want more fine grained control over how services communicate with each other to assign service accounts per workload.

[00:41:47] Ned: Okay, I got you. Now, when I was reading some of the documentation, it got a little confusing between namespaces and service. You’ve cleared that up for me. I get like, okay, this is the way I would probably want to set it up obviously by service account and service account to a single application that needs that layer. Whatever the processing is for it okay.

[00:42:09] Christian: Exactly. Yeah. I think if you saw the blogs of the documentation somewhere, they kind of use namespace, service account and changeably. But it’s not it’s definitely it’s service account. But depending on how people have deployed their service accounts, it might end up being namespace because there’s only one service account.

[00:42:26] Ned: Got you.

[00:42:27] Ethan: Christian, talk to us about network latency because if I’m moving layer to layer and there Azure network calls involved and Tuttling and so on, how much overhead am I getting into? Do I have to be concerned about that.

[00:42:37] Christian: Yeah. So when we first started proving this out, the measurements that we took from running the sidecar to introducing this extra hub, we evaluated the latency that processing the request. If you look at the sidecar, you’re processing it on both sides, the origin and the termination of that request or the connection, and you’re doing layer seven on both sides. Typically, if istio can detect it’s Http or you’ve labeled your ports Http, it will parse the Http stream. And so what we saw in those early experiments is that if we eliminate that layer seven processing, at least in one of those places and just move it to the Waypoint proxy and basically trade it for some layer four processing. So two layer four vROps and one layer seven was cheaper in our test than forcing everything through layer seven. Always, all the time, on both sides. Now, in our first release here of Ambient, we’re still not at that performance threshold. We’re not there. But we know what we need to do and what we need to optimize to get to what we saw in our initial proving out. The solution here there is latency. You will take some minimal amount of network latency because of those hops, but it’s extremely minimal.

[00:44:29] Christian: But by eliminating some of the layer seven processing when we don’t need it, then we gain that back or better.

[00:44:36] Ethan: Does eBPF factor into this architecture at all at some point?

[00:44:41] Christian: Absolutely it does. And actually, that was the initial thing. That’s what we were working on at Colo that started to get us to go down this path, which was, how do we optimize? Because eBPF has a place in this architecture for optimizing the network paths, pulling some of the telemetry that we’ll need about the traffic flowing through the kernel. And that is an area. So when we start to look at I think Ned pointed out how that secure overlay layer, that layer four component starts to really become part of the CNI or compliment. There’s some synergies there. That’s the area that we feel we can offload some of that layer four behavior to ebbs and continue to optimize the data path here.

[00:45:44] Ned: Okay, one important thing that I’m not sure we touched on and I just want to clarify on, can I run Ambient mesh sort of in a hybrid mode and still use sidecars where it’s required or there’s not feature paired yet elevators?

[00:46:00] Christian: Yes.

[00:46:01] Ethan: Architect the most complex thing ever. Jeez they can.

[00:46:06] Christian: Well, the reality is, especially for the people who are currently using istio, there will need to be some of that transition period. What I believe Azure, the goal, what we think will happen here is that when Ambient gets to a mature state, that’s where most people will start, is with Ambient. And then they would breed in a sidecar deployment if they really need that extra, like an application, it needs its own sidecar and it needs to allocate the resources because it needs to behave a certain way. I’m sure we’re going to see optimizations where we need that, but I think most people are going to start with ambient once it’s mature. But the people who have sidecars today, if they migrate over, they’re going to need to support traffic flowing through the mesh, whether it’s a sidecar or ambient. And so when we released the initial version of ambient, we do support that interrupt between sidecars and the ambient workloads, and we’ll continue to do that sidecar. We’ll continue to be a first class deployment option in istio, and we’ll need to support the interrupt between those two.

[00:47:28] Ethan: Everything in me just screams, built out another cluster, then move with the new architecture, not do the simultaneous hybrid model.

[00:47:35] Christian: If you 100%, unfortunately, dealing with enterprise constraints that we have to see.

[00:47:46] Ned: It.

[00:47:46] Christian: Might not be that straightforward in an ideal world.

[00:47:49] Ned: Right. If folks want to kick the tires with this, they want to give it a try. Is there a specific version of istio they have to have installed, and do they just then roll out a helm chart? What’s sort of the deployment model if they want to try this?

[00:48:04] Christian: Yeah. Again, a time of recording here. There is a very specific version, but this is very quickly emerging upstream. So it’s not a branch right now, but it will merge into the main line where we can cut releases here pretty soon. 116, I think, is what we’re trying to be ready for. It might not be, but it very soon should be in an actual release. Until then, I would say there’s a blog on istio IO that shows you how to get started. At colo. We built a workshop we built a workshop on instruct. It’s an awesome environment, by the way, to give you self paced, hands on introduction to ambient mesh. Cool.

[00:48:56] Ned: And if people want to find that, can they just go to solo IO.

[00:49:00] Christian: Or what’s the go to solo IO under solo academy? Yeah.

[00:49:03] Ned: Okay. All right.

[00:49:04] Christian: Awesome.

[00:49:04] Ned: Yeah, I’ve used the instruqt platform, and I really like it just gives you a sandbox that has some time on it, and you can just mess around, do your thing, follow the lab, and much more hands on than the standard just click and watch situation, right?

[00:49:20] Christian: Yeah.

[00:49:22] Ned: Cool. Well, Christian posta, thank you so much for being a guest today on day two cloud. We really appreciate you taking the time and giving us some knowledge on Istio ambient mesh.

[00:49:33] Christian: Yeah. Thank you all. Thanks for having me. Happy to do this.

[00:49:36] Ned: Absolutely. And hey, virtual highfives to you out there for tuning in. If you have suggestions for future shows, we would love to hear them. You can hit either of us up on twitter. We both monitor the accounts at day two cloud show, or if that’s not your thing, you can head over to daytwocloud.io and fill out the nice request form that Ethan so helpfully set up for me. Thank you again, Ethan. Did you know that Packet Pushers has a weekly newsletter? It’s true. It’s called Human Infrastructure magazine. It is loaded with the best stuff we found on the internet, plus our own feature articles and commentary. It’s free and it does not suck. You can get the next issue via Packet Pushers net newsletter. Until next time. Just remember, cloud is what happens while IT is making other plans.

More from this show

Day Two Cloud 174: Building Kubernetes Clusters

On today's Day Two Cloud podcast we walk through how to build a Kubernetes cluster to support a container-based application. We cover issues such as what constitutes a minimum viable cluster, rolling your own vs. Kubernetes-as-a-service, managing multiple...

Episode 173