Search
Follow me:
Listen on:

Day Two Cloud 082: You Don’t Need A Service Mesh

You don’t need a service mesh. You probably don’t even need microservices. That’s the message from the person who created Envoy, the open-source proxy that serves and a networking abstraction layer in microservices architectures. Our guest is Matt Klein, Digital Plumber at Lyft.

This episode gets into big ideas about application design; challenges around networking, load balancing, and security in dynamic systems; service meshes, and a lot more.

We discuss:

  • Major networking issues with containers, Kubernetes, and highly dynamic environments
  • Accepting that things fail and finding ways to deal with those failures
  • Ingress and egress control, API gateways
  • How to tell when you need a service mesh–and when you don’t
  • Will we be stuck managing monoliths alongside microservices?
  • Whether a service mesh makes sense for serverless

Sponsor: BMC

Is your business on its A-Game? It’s when systems are intelligent, automation is effortless, and when technology and people work as one. The A-Game is your business at it’s best. BMC calls this the Autonomous Digital Enterprise. Find out more at bmc.com/agame.

Show Links:

MattKlein123.dev – Matt’s Web site

Matt Klein’s Office Hours

@mattklein123 – Matt on Twitter

Envoyproxy.io

Transcript:

[00:00:00.960] – Ned
[Ad] Is your business on it’s A game, it’s when systems are intelligent, automation is effortless, and when technology and people work as one, the a game is your business at its best. BMC calls this the autonomous digital enterprise. You can find out more at BMC.com/agame.

[00:00:26.680] Welcome to Day Two Cloud and boy listener, are you in for a treat? This is going to be a roller coaster of an episode for you. We are talking to Matt Klein from Lyft, the guy who started the envoy project.

[00:00:39.640] And he has a message for you that you don’t need service, you don’t even need micro services. And I’m going to stop there because there’s a lot more to talk about. But I’m going to throw this to you. Ethan, what stuck out to you in the conversation?

[00:00:52.150] – Ethan
Those things he even got to the point of, you know, if a virtual machine works for you, use that. I’m going to shut up now because Matt’s great. And let’s get into that conversation as quick as we can.

[00:00:59.750] – Ned
All right. Absolutely. Let’s get started with Matt Klein. Matt Klein, you are a digital plumber at Lyft. That’s interesting to me on. Absolutely. Happy to have you here. Why don’t you tell the folks a little bit about yourself and what you do as a digital plumber?

[00:01:18.110] – Matt
Sure. I’m a digital plumber slash software engineer at Lyft, where I’ve been now, amazingly, for over over five and a half years or so, quite some time at Lyft. I do two things. I spend about 50 to 60 percent of my time doing infrastructure leadership. So I help lead our teams that work on networking and service mesh and API, Gateway and compute and deploys and all of those types of things. And then I spend 40 to 50 percent of my time working with the open source folks.

[00:01:49.570] I started the envoy proxy project at Lyft, started it internally about five and a half years ago also. And since then it’s been come open source. It’s become, much to my amazement, an extremely popular project. So now my time is split pretty much evenly between both the internal leadership as well as the external project leadership.

[00:02:10.570] – Ned
Wow. Yeah, I have certainly encountered envoy in my day to day because I’ve been working with another company, Solo.io on their API gateway, which of course uses envoy. Yeah. So that was my introduction and suddenly I was thrown into the world of xDS and other DSes of different kinds. So I have you to blame is what I’m that’s what I’m hearing here.

[00:02:31.630] – Matt
I am probably the root of all blame here. Obviously now the community and the user ship has grown so much that I don’t think I can I can take all blame. It’s just so, so many people working on the system now. But yes, I am the root of blame, as it were.

[00:02:51.520] – Ned
Well, I actually really enjoy it. Thank you. Thank you for what you’ve done. Now, what we wanted to talk to you about on the show today is service mesh, but we can’t dive straight into service mesh right. Sure. Let’s let’s start let’s zoom out a little bit in the world of networking and Container’s networking seems pretty challenging. I mean, with Kubernetes, it’s very dynamic. There might be one or two or three layers of networks underneath Kubernetes.

[00:03:21.490] So what sort of issues do you see people struggling with when it comes to networking and kubernetes?

[00:03:27.910] – Matt
Yeah, so, you know, it’s not it’s not just kubernetes. I think it’s important to step back. And I think what I would say is that computer networking is complicated, period. It’s just just in the sense that it’s a domain where things do fail.

[00:03:45.940] I think lots of times of software, software engineers, it is relatively relatively easy to pretend that things don’t fail. And I don’t know that people make their software and systems robust enough to the failure that actually does happen. But in the domain of computer networking, it’s just things fail so often that it’s almost impossible to pretend that that those types of failures don’t occur. So I think stepping back, it’s really that even in, quote unquote, simple computer networking where you have fixed hosts and things like fixed IP addresses, you’re still seeing failures, you’re still seeing packet drops and in hosts go offline and all of those types of things.

[00:04:27.700] I think the issue with kubernetes and micro service deployments and I would say highly dynamic environments. So again, it’s not really kubernetes, you know, you could have a functional system like AWS lambda, Google Cloud Run just the idea where we have resources that are coming and going, whether they’re auto scaling, whether they’re functions, and in these types of highly dynamic environments, types of networking failures that one sees with things coming, things going, things failing, it’s just the rate of those occurrences are so much higher and they’re probably an order of magnitude or or or more higher.

[00:05:07.600] So all of the edge cases that might have been oh I can kind of brush those under the rug. It’s really impossible to brush those things under the rug. They just happen so, so often.

[00:05:20.140] – Ethan
Well, we would have a higher error rate. So do you mean higher in just the fact that there’s so many more transactions going? On that, there’s just proportionately more transaction failures.

[00:05:28.990] – Matt
Well, it’s not even about the transaction failures. It’s more it’s more about the fact that resources themselves are actually dynamic. So when it comes to auto scaling environments or auto scaling container environments, the rate of change, meaning, let’s say that I scale up and scale down and technologies like Kubernetes or other functional platforms, they’re going to allow me to scale up and scale down more often in, I would say, in in a finer response to load and that more dynamic auto scaling and also the assumption that things can just come and go.

[00:06:07.750] So. So take a take a kubernetes or a container deployment environment in which maybe I have two classes of jobs running. Maybe I have what I would call real time jobs, and then I have what I would call batch jobs. So things like machine learning workloads or things where they can run in the background, you may actually design your system from a cost perspective to preempt or terminate the batch workloads if my real time workloads have to run, right?.

[00:06:36.190] And so really, not only are the resources not fixed, but you’re designing in failure. You’re designing in the fact that your service topology may change at any particular time. So it’s not just raw transaction failure. It’s the fact that my entities on the network, even if the network, as we said before, is multiple layers of abstraction, it’s layer upon layer upon layer is the fact that these things can just come and go so often that routing problems happen, networking problems happen and a host might drop off online, might drop offline in the middle of the network transaction.

[00:07:14.890] You might have health checking failures. You might have all all types of manner of failures and your service discovery system, the system that tells you the topology that allows you to figure out, OK, what what host can I route to, what host can I load balance to? That is also changing. So just the rate of change causes more pressure on the system. It’s just a more complicated environment.

[00:07:38.770] – Ethan
Like you happen to be talking to a bunch of network engineers that are probably listening to this show as well as cloud people, Matt. And when you mean networking, you’re talking about a little little higher up at the service layer stack, not like top of rack switches and cabling and that kind of stuff.

[00:07:52.570] – Matt
Correct. Right, right. I’m talking about yes, I’m talking about application networking. So things would typically sit on top really and typically the L3, L4 layer and above.

[00:08:05.650] So you’re talking about an application that might come off at a particular IP address, any particular port, and maybe it’s only up for 30 seconds and then it’s down. Right. So we have this eventually consistent system that is always converging. And it’s not that at the L2 L3 layer, it’s not the things don’t converge using things like ECMP and BGP, et cetera. But those things are typically not the norm they’re in response to a particular failure. And what I’m saying is that in these highly dynamic, highly scalable application environments, this type of convergence is actually the norm.

[00:08:47.080] It’s not just in response to failure. So whereas in a lot of the lower layer networking systems, some of the failures happen relatively infrequently and maybe you can suffer a multi-second downtime. It happens so frequently in these application architectures that the bar to having it be lossless is higher.

[00:09:10.690] – Ned
Right. So that you necessarily have to introduce new constructs to deal with all these additional failures.

[00:09:18.070] – Matt
And it’s it’s not just failures. It’s also it’s making some of the things that would be failures, making them making them graceful. So, for example, if I let let’s just take something again. The actual fact that is using Kubernetes doesn’t actually matter. But let’s just say that I have a highly dynamic environment in which I’m spawning containers, maybe in it to take to take this example the furthest maybe they’re running single functions and maybe they’re only running for tens of milliseconds or hundreds of milliseconds.

[00:09:53.020] So it’s possible that I’m bringing something up. I’m bringing it into my service discovery system and I’m bringing it down now in order to appropriately route traffic to these entities that may live for a very long period of time and have that be done gracefully, you have to be very intentional and careful about how you do service discovery, how you do health checking, how you do failure detection and how you do graceful draining. So if I decide to take a host or an entity out of rotation.

[00:10:25.240] I typically don’t want to just turn it off, right? It’s like I have to notify all of the people that might send traffic to it to say, stop, stop sending traffic, and then I’m going to turn it off. So there’s just a lot of things that have to come together to make a system that wants that level of rate of change to have that work gracefully. That’s where the challenge comes from.

[00:10:49.660] – Ethan
All the problems we’ve ever had with load balancing architectures ever for decades, but at a scale of where there’s a rate of change that’s happening very, very quickly.

[00:10:59.470] – Matt
Exactly right. And it’s it’s really it’s the rate of change that actually makes it the most difficult. It’s not that the problems have significantly changed, meaning, as you say, we’ve been doing load balancing. And yes, we’ve moved from hardware devices to software. But most of the techniques are are very much the same as the ones that we’ve been using for the last few decades. It’s the rate of change. It’s the fact that when I’m talking about a physical rack or physical infrastructure, you install those things.

[00:11:28.090] And yes, of course, they fail. And the protocols have to deal with failure. But you don’t you’re not going around and plugging and unplugging top of rack switches every three seconds. That just doesn’t happen. Whereas with these application frameworks, you might actually be doing the equivalent of that. And that gets very challenging.

[00:11:48.650] – Ned
That’s that’s a lot to think about, and it sounds like until you reach that rate of change, previous systems will work just fine. It’s when you hit that rate of change that suddenly and I’m sure it’s a compounding issue as well.

[00:12:02.810] – Matt
Yeah. So to give you a very concrete example, as I said, we built envoy at Lyft. We started that project at Lyft over five and a half years ago when I joined and without without getting too much into the weeds of history, just to just to give a very brief history of the project. Envoy started at Lyft not as a not as a service mesh, started actually as an API gateway.

[00:12:27.200] So we first rolled out Envoy as an API Gateway to get better observability. And again, lift at the time was facing most of the problems that people face when they migrate from monolithic architectures to micro service architectures. Just lots of problems around networking that are hard to understand. Lots of problems around lack of observability. And and then again, I won’t go into the into the huge, long history unless you want to talk about it. But over the next six to nine months, we eventually deployed envoy almost everywhere as both a service client side, load balancing proxy, as well as an API gateway.

[00:12:59.900] And we got a lot of the benefits around abstracting the network, abstracting a lot of the types of failures that we’re talking about here. So the application developers can focus more on their underlying business logic. And but at the time the envoy was built for Lyft, Lyft was not using kubernetes. Right. It’s like Lyft, Lyft again. We were doing this five and a half years ago kubernetes was just really starting back then. So Lyft at the time, like many companies of its era, was running on a bespoke virtual machine based infrastructure with a bespoke deployment system.

[00:13:34.530] And again, the rate of change of that system. Of course, it did auto scaling, meaning like we would auto scale up in response to load, then auto scale down, response to load. And obviously EC2 virtual machines would fail, but it would take several minutes to bring up a new machine or a machine that wouldn’t scale down for several minutes or something like that. So the envoy based deployment was built up around that granularity of scaling, meaning things were scaling on the order of five to ten minutes.

[00:14:08.150] Right. Not five to 10 seconds. And what we found as Lyft started its kubernetes migration. And that has been almost done now, but has been ongoing for several years, is that Kubernetes is, by its very nature, increases the rate of change of most of these events, meaning I just scale up in the order of seconds or I scale down in the order of seconds. So the rate of change on the system is an order of magnitude more than what we had before.

[00:14:37.250] And we saw that break in all kinds of places. It puts stress on these service discovery system. It puts stress on envoy to deal with the rate of change that we’re dealing with. It puts stress on the health checking system. It puts stress on basically every component. And we had to make quite a few changes to make the system scale for that greater order of magnitude change. Now, we can have a separate conversation as to whether that work and migration was actually worth it.

[00:15:06.170] But that’s a that’s a that’s a totally separate conversation. But if we take for a fact that we had to do that migration and that that rate of change was appropriate and what we had to deal with, it was just a much more complicated problem. It just puts it’s a higher level of scaling requirement than what we had before.

[00:15:27.980] – Ned
And that that my question to follow up, that would be, is there a progression of things you need to have in place as that dynamic environment evolves? Do you start out with a load balancer and then you need an API gateway and then you need some sort of egress control and a service mesh? Is there a nice progression that way or is that I hit the tipping point and so I need all of these things and I need them yesterday.

[00:15:50.720] – Matt
I don’t think there’s a typical progression, perse, and this gets into more around who who needs what. And this is a very popular topic of conversation these days. And it’s very easy to go on Twitter and to learn about all the different types of technology that one might need. And I personally take a very pragmatic view on these things, which is I always tell people you don’t need any new technology until proven otherwise. And so to me, the question really is what are the things that are going to prove it otherwise?

[00:16:27.800] And that’s why when I’m talking, I’m talking to organizations. I always tell them, stick, stick with your monolithic architecture, build on the cloud, use the cloud load balancer. Right. It’s like if you can get a long way using a cloud load balancer with your monolithic application and some cloud database. Many companies worth billions of dollars have been started in that simple way, and I, I always chuckle because sometimes these days you see these very small companies or small products that have five or seven developers and they have these incredibly complicated architectures with micro services and service meshes and all these things.

[00:17:09.240] And honestly, I just chuckle because it’s not a good use of time. These things are not providing business value. So what I always caution people is you always want to pick the simplest technology possible until there’s a clear business reason that you can’t do that anymore. And to answer your question specifically, the reason that it usually happens is a human reason. It’s not a technical reason. And the human reason being that your organization scales in number of people to a point in which a monolithic, simple architecture with a with a single database and a single application and a single load balancer, it typically doesn’t scale.

[00:17:53.190] And I don’t mean technically scale, I mean human scale, meaning we simply have too many people that are trying to get changes into the same thing and you’re having problem with deploys. And so when I look at the modern micro service architectures or the modern service oriented architectures, they’re really they’re a technical solution to a human problem. But many things are right. I mean, that’s just the way that the world works. So what I usually tell people is that bringing on these types of technologies, they’re going to yield a lot of pain.

[00:18:27.660] I mean, they’re just very complicated and they have to be replacing enough. Like it has to be net pain negative, right? Meaning like it has to be replacing more pain than it’s adding and I think a lot of times people don’t think about it that way. They view these types of technologies or they look at very successful companies that have hundreds of thousands of engineers who’ve adopted particular types of deployments. And they say, well, X company has done this, therefore, I must need this.

[00:19:01.560] And they’re not accounting for the fact that those companies have people maintaining those systems or they’ve done a ton of engineering to make those make those systems work, meaning there’s pain that’s involved in those systems, but they’re replacing enough pain that it’s worth it. I think people don’t think about it that way. And that causes the adoption of a lot of technologies that people probably shouldn’t look at.

[00:19:26.940] – Ethan
Matt, I look at a lot this stuff, you just said, you know, the complexity and all of that. I look at a lot of these things. I’m reading the release notes like we’re recording this in mid-December, 2020 Kubernetes 1.20 just came out. You read the release notes for that. It looks like a train wreck just happened.

[00:19:39.690] We put all these new features in. There’s all those new alpha stuff. It’s off the wall. The rate of innovation is amazing. And I’m looking at that and going, OK, so what Enterprise really wants to go down that road and adopt these things when they don’t have an army of engineers, they’re looking for something that’s a more mature and stable kind of product. And I think in many cases, I don’t feel like we’re there yet for a lot of the reasons you just said, are we there? Is there some level of maturity with container networking and kubernetes and service meshes?

[00:20:08.880] – Matt
In some sense? So so no. But I think it’s a more it’s a bit more nuanced conversation. And by nuanced, I mean that on one hand we can look at it and say. Where, I’m not sure when Heroku started 10 years ago, 15 years ago, not actually sure it would be. It would be an interesting point. But we’re still not back to an open source solution that works as well as Heroku worked like 10, 10 plus years ago.

[00:20:38.580] Right. And so on. Some sense that’s a that’s a pretty damning fact for us as a as an infrastructure engineering community right now. On the flip side of that, lots of things have changed. There’s been a big move towards open source people want open source solutions that they can modify and build on. There’s been a big move towards public cloud. So there’s lots there’s lots of innovation happening. And I think that if we look at it from you know, from a public cloud perspective, it’s pretty clear, meaning I think the writing is on the wall in terms of the way the architectures are going to go, meaning if you look at the investment that AWS is putting into systems like Lambda, or Google is putting into Cloud Run and Azure is putting into their functional system, to me it’s absolutely clear that in the next five to 10 to 15 years, people don’t care about Kubernetes.

[00:21:32.580] They don’t care about envoy. Right. It’s like they just want to write their code. They want to have their code auto scale and run and load balance and expose some endpoints to the Internet. And they want to have databases that want some caches and they want some Pub/Sub systems. That’s what people want. Now we’re getting there on multiple angles, right, like angle one is the cloud provider perspective where they’re building these systems, but they’re going to try to lock people into their particular cloud platform.

[00:21:59.740] And then we’re getting there on the open source side where people are implementing a bunch of these things in open source. And many of the enterprises, again, we can have a longer conversation about why this is. But I think many enterprises view open source as an insurance policy and the sense that they may they may never edit the source code. Right. Like they may never understand the complexity of Kubernetes or some of those things, but they want the insurance policy to know that if they ever had to pay someone to edit those files or fix some bug, they’re not at the whim of a proprietary vendor.

[00:22:36.730] My my larger point is that we have a bunch of competing concerns here. And though I think over time we’re actually trying to get back to a place where people can scalably build their applications like they used to on something like Heroku, it’s taking a bunch of time across different vendors, across different open source solutions. So to get back to your specific question about what did the enterprises want, you’re right. Like enterprises want something that that works and that satisfies their goals.

[00:23:09.880] But it’s a lot more complicated than that because enterprises in particular, they’re not greenfield, they’re not making new applications. It’s a it’s a mess of decades of existing applications that may include mainframes. I mean, it’s just like you see these environments that go all the way back for decades and decades.

[00:23:28.000] So I again, I have a very pragmatic approach. And I think what you see is you see a lot of vendors out there who are targeting the quote unquote, kubernetes wave. Right where they’re making assumptions of the application’s entirely in Kubernetes. I can have kubernetes-based solutions and and that’s fine. But the real world is much messier than that. And it requires the merging of lots of different types of types of solutions. And there really isn’t one size fits all.

[00:24:00.620] And I think that we’re going to see a mix of migration types of different platforms, and I think depending on the sophistication of the enterprise, they’ll pick and choose different building blocks. But again, I’m a firm believer that the only way to do this sanely is to have layers of abstraction and the enterprise in question. They have to make their own choice as to what layer do they want to hook in at. Right. Like if they’re most sophisticated, maybe they hook in at the Kubernetes and envoy layer.

[00:24:31.330] If they’re less sophisticated, maybe they pop in at the functional layer or they use one of the cloud provider hybrid solutions or they go full on prem or something like that.

[00:24:43.240] – Ned
[Ad] BMC wants to know, is your business on it’s a game, it’s when systems are intelligent by learning from markets where automation is paramount, yet effortless, and when technology and people work as one in an enterprise, the a game is your business at its absolute best. BMC calls this the autonomous digital enterprise. You can find out more at BMC.com/agame.

[00:25:12.810] – Matt
There’s just there’s no there’s no silver bullet here, we’re dealing with a very high rate of change of systems. Things are very rough. I call it the cloud native Wild West. And I like to hope that things will be better and a bit more standardized in the next five, 10, 15 years. But we’re not we’re not there now. I mean, at this point, there’s a lot going on.

[00:25:36.540] – Ned
You look at the CNCF landscape and it’s just this complete mess of companies and projects and standards and they’re all competing with each other. And I always think about that xkcd cartoon where it’s like there are 14 competing standards. There should just be one. I’m going to invent it. And now there’s 15 competing standards. So is it just a matter of time for this stuff to shake out so we will have less competing standards and projects?

[00:26:03.600] – Matt
I don’t know. So, yes, I think time I think time will always help. And again, this is this is the area where I think you’re going to talk to ten people and you’re going to get ten different opinions on what I’m about to say. I think some of the trends are clear. Who the winners are going to be is less clear because there are this is a hugely competitive landscape where, frankly, there are millions, hundreds of millions, billions of dollars on the table in terms of potential revenues.

[00:26:35.460] And so we’re not we’re not talking about something where the technically best solution is necessarily going to win. I mean, this is a very complicated area with lots of players. Now by trends, we’re coming back to what I was saying before, which is to me, the trends are super clear, which is that no one cares about Kubernetes. And by no one, I mean the end user, the people that are writing the applications that actually make companies money don’t care about Kubernetes.

[00:27:07.890] They don’t care about envoy. They don’t care about most of the plumbing that that we’re gravitating towards to make some of these things work. Meaning if I go and again, I keep coming back to the cloud provider, functional offerings, which as a product are are the most mature. So if I look at Heroku today or I look at again, Lambda or any of the cloud providers functional offerings, I don’t care what load balancer they use, I don’t care what orchestration system they use.

[00:27:35.040] It doesn’t matter to me, like, I just I want some code again. I want it to be exposed to the Internet or not. I want some ACLs or not. I want a database or not like I just want it to work. So, so, so to me, like, the trends are really obvious and I think in practice under the hood we can discuss whether it’s good or not. But I think that kubernetes is becoming a standard that a lot of technologies are going to be built on.

[00:28:00.540] Do the end users need to know that they’re using kubernetes is probably not. Similarly to envoy. I mean envoy. Again, lots of people have opinions on whether this is good or not. I’m biased. I think envoy is becoming the de facto standard that people are using to build a lot of the networking components of these systems to most people need to know that it’s envoy? No, like they don’t. They just want the features. Right.

[00:28:24.030] So so I think the trends are clear. The trends are that we’re moving towards more abstracted, functional platforms. We are moving towards technologies like Kubernetes and envoy, probably becoming an implementation detail that a lot of these systems are are built on top of. But what are the systems on top and who wins that battle? That I can’t say. And I think that the next five to 10 to 15 years are going to see a high rate of change in terms of the cloud vendors and what they’re doing, their approach to hybrid.

[00:28:58.500] By hybrid, I mean on-prem/cloud, the purely open source solutions that are trying to build you this Heroku like functional stack on top of purely open source software. I think we’re going to see a lot of evolution in in the solutions that people are using, but very hard for me to foresee who the winners are at that point because the competition is so fierce right now. And I think that a lot of the solutions that we have today are super rough.

[00:29:28.500] They require a lot of technical expertise to piece together the insane landscape and figure out what portions of the system that we’re going to use. And it’s just too complicated today for sure.

[00:29:42.270] – Ned
That was sort of the promise of the cloud was we’ll manage that for you. You know, you just deploy your application. You just do your thing. And they feel like some of the earliest cloud services, that’s all they were. It was literally load your code into here. We’re going to run it. And then as things grew, people wanted more control over what was running underneath. So you got things like IaaS available where you run your own virtual machines. And that model, that that whole idea has taken a long time to come down on-prem because it’s such a mess on-prem.

[00:30:13.170] – Matt
Yeah. So so by by any measure, cloud has been hugely successful. Right. I mean, it’s like if you look at the types of applications that people can build now in a literal fraction of the time that they could build previously, it’s been an incredible success. That success has come with its own downside, and that downside is that it is so much easier now to build solutions, to build standards, to build technologies that we see a proliferation of. Again, as you say, there’s 14 standards. Let’s just make something new. And I’m I mean, I’m certainly guilty of that. Also, I build envoy when there was existing solutions. So we we like to think that we’re doing this for good reasons. And sometimes there are. And in the sense I’m a I’m a technical capitalist. And by that I mean I think the market generally sorts itself out. It’s like people gravitate towards solutions that ultimately work for them.

[00:31:15.180] But I think that because the rate of change is so high, like we just have so many solutions that are being built. It’s a bit of a scrum right now. Right. It’s like there’s a lot of people that are throwing things over the wall. There’s a lot of money around, both in terms of cloud provider capital, venture capital. And there’s so much money pouring in. There’s so much solutions being built. There are so many people that want to undergo their, quote unquote, cloud native migration that we’re just seeing the marketplace be bombarded with different solutions, some of them competing, some not.

[00:31:50.730] And I do think that over time some of the trends are going to become a bit more stable. And I think there will be winners and losers, but it’s going to take some time for all of that to sort itself out. And in the meantime, it’s messy. And each organization and again, we’re bringing this conversation full circle. But because of how messy things are right now, the biggest piece of advice that I give people is focus on the business need first.

[00:32:20.830] Right. Focus on what problems are you, the business for your company? What are you trying to solve and what is the simplest and cheapest way of doing that? And there is not going to be one answer. The answer is different for every organization. And too often these days we see people that do technology herding meaning. They see what certain companies do, are very vocal about what has worked for them. And then people assume that because Company X has done Y, then they must also do X and I’m sorry I mean, they themselves do Y.

[00:32:58.170] And I think right now when we just have this proliferation of solutions, that’s not the right approach to take.

[00:33:05.160] – Ned
Let’s say I’m an operator, I’m an admin, I am down in the trenches and I’m hearing all the things you’re saying. And I’m like, well, forget it, I should just quit and go grow hay or something because this is just completely insane. What’s your advice to that person who looks at this insane landscape and everything constantly shifting and goes, well, can I just run a virtual machine? What why do I need all this other stuff?

[00:33:31.110] – Matt
I would say to them, you can. And that’s that’s what I tell people, is that so first of all, I, I am actually a late adopter now, which which you might you might find odd, given that I’m one of the poster children for the for the cloud native movement. But but I’m a I’m a relatively a technology late, late adopter. I mean, look I wrote envoy on C++, right. It’s like I tend to be on the trailing end of technology adoption just because I typically have found that to be a better business outcome.

[00:34:08.430] It’s like I tend to I tend to look for business problems that are not solved by existing things on the market and then try to figure out the right pragmatic way to fix those issues. So to the person that you’re talking about who says, well, why wouldn’t I just use a virtual machine, it works fine for me. I would say just use a virtual machine. If it works fine for you, like I would, I would use the simplest technology that will get the job done, because those technologies tend to have the least amount of dependencies like there was there was the big AWS Kinesis outage a couple of weeks ago.

[00:34:44.730] And again, without without getting into the details of that outage. What was interesting is to see when Kinesis failed, how many other AWS services also failed. And some of those were things like Lambda. Right. And in general, things that are older and more stable will tend to have less dependencies on other systems just by their very nature. So from a risk avoidance perspective to to talk to your sysadmin or someone who’s saying, can I just use a virtual machine if a virtual machine works for you, it’s very simple and it’s unlikely to fail in almost any scenario.

[00:35:20.280] So you know what? Just go and use it.

[00:35:22.650] – Ethan
Let’s put this on its head, though, Matt. What if I’m used to using a virtual machine, like a really good operational process is built around that. And so therefore, I don’t know what I don’t know. And maybe. Maybe. Micro services, which are all the rage. Maybe that’s better for me, but I know how how do I know how do I know that I shouldn’t go that way? It seems like the right thing to do. Maybe so many people are blogging about it.

[00:35:44.600] – Matt
Because to me, the way that you just phrased it is exactly the wrong way to go about any change. Right. Because if to me you only look for new solutions if you have problems with your existing solutions, this that this comes back to what I was saying before. Right. Which is the way that you that you have more bugs is you write new software, right? It’s like new software yields new bugs.

[00:36:09.500] So by by changing your system, you’re going to break something. It’s just it’s just the way of the world. So I always counsel people. It’s fine to read blogs and to go to conferences and to look at talks and try to understand what maybe we would call the state of the art. But I would never counsel change for change sake. I would only counsel change if the current situation is not working. And then the key is let’s let’s figure out why is it not working and what is the simplest change that I can make that will that will make it work.

[00:36:47.750] – Ned
I want to play devil’s advocate a little bit here, because you said the way that you introduce more bugs into your code by writing new code, which is absolutely true. But the other thing you get when you write new code is new services and features that your consumers and customers might want or benefit from it. That whole like if you asked what is the if you asked a buggy maker what they wanted, they’d want a faster horse. But what they really need is a car. Like, how do you know you’re not missing out on the car?

[00:37:15.170] – Matt
Yeah, that’s a that’s a great question. I think there’s two aspects of that. Aspect one is that typically up to now we’re talking about infrastructure software. Right. And mostly our conversation has been around infrastructure software in service of some application, like some business logic. And I would argue that if we’re so and again, this is a this is a two part answer. So the part one is, if I’m providing infrastructure for a business process or some like some actual end user need, my customer is the end user.

[00:37:50.240] Either I’m satisfying them or not. Right. So it’s like there’s no there’s are they able to run the business and deliver the business features that they need? Right. And and so so to me, the most important thing is to look at the customer base and say, is the customer my customer. Right. My as an infrastructure provider, as my customer, able to yield the business features and the business outcomes that they desire. If they can yield that, given the existing systems, then there’s no need to make any changes.

[00:38:24.020] So it’s so it’s a it’s a bit of a constant iteration cycle to figure out if I’m satisfying my customer. Now, part two of this is as an infrastructure provider, whether I’m an internal infrastructure provider or a vendor, whether that be know a on-prem vendor or a public cloud. Certainly there’s a there’s an element here, which is interesting because there’s leading the customer and there’s also following the customer. And I’m a big believer that you have to do both.

[00:38:53.420] Right. It’s like you can be so far ahead of your customer that you’re not giving them the technologies that they need to solve their present day problems. But if you only follow the customer and you don’t lead them to where you want to go, you’re probably never going to make the incremental improvements that will allow them to unlock more business value. So I think it’s a constant back and forth. The thing is, is that I think that in many cases, unless one is a vendor or one as a public cloud provider doing, I would say, more foundational development.

[00:39:25.910] I think if we come back to the standard enterprises, I think typically there’s less leading the customer and a bit more following the customer. And I would always encourage people to really listen to the business needs of what people need to do and make sure that that is being satisfied versus making a statement like all the cool people are doing micro services and service meshes. Therefore, this is ridiculous. If we don’t do that, we must have a micro service architecture and a service mesh or whatever.To me, that’s just not the way that we should approach technology

[00:39:57.950] – Ethan
Matt, the premise of this show when we first hit you up to come on Day Two Cloud was you don’t need a service mesh, but you’re actually making a broader argument, which is doesn’t matter what you’re evaluating, all the stuff that’s going on in the CNCF, all the Kubernetes stuff and the service mesh, whether you bolt that on to the mix or not, you’re saying take it, not just a step back, step BACK and really understand the problems you’re trying to solve and why you would consider any of these technologies.

[00:40:25.790] They’re not fully baked. It’s going to hurt a lot to get them implemented. And going back to your pain analogy, if it doesn’t fix problems, you’ve got. What are you doing this for?

[00:40:33.940] – Matt
Right, so it’s funny, I, I on Twitter, I am so close to muting the phrase service mesh at this point because it’s just like it kills me. I mean, the service mesh conversation has just gone completely off the rails because of vendor fighting and like a whole bunch of stuff that I honestly don’t really want to go into. It’s it’s just a big mess.

[00:40:57.870] And here’s here’s the thing, though, and this is why the conversation, particularly around service mesh, has gotten so insane, which is really the question is about to me. Do you have a micro service architecture or not? Right. And and and and then to your point, we can backtrack further and say, should you adopt a micro service architecture in the first place? But where the conversation around server mesh has gone completely off the rails is that if you have a micro service architecture again, let’s let’s let’s try not to go too far backwards.

[00:41:33.110] And let’s at least start with the assumption that as an organization, I have decided that for human scalability reasons, I’m going to adopt a micro service architecture if I have made that assumption. There are problems that every like there has never in the history, our short history of computing. There has never been a company that has not faced the problems that I’m about to talk about. Every company will face problems around networking around Observability. They just will like that is a fact of the matter.

[00:42:02.930] Now, where server mesh has gone into a vendor mess of fighting is that it’s glossing over the fact that these problems will occur. So it’s not a question about should you or should you not have a server mesh? That’s the wrong conversation. The conversation is you are going to have to deal with these problems. You can deal with them with a sidecar proxy. You can deal with them with a library. You can deal with them with some other set of proxies, like there’s different implementation details of how you can deal with them, but you will have to deal with them or your micro service architecture will not work. So the problem with the service mesh conversation is that it has gone again into this crazy like FUD fast and it’s mostly related to sidecar proxies and fighting about them and bla bla bla bla bla. The conversation should more be about we have a set of problems, right. And now we have to solve them. So let’s have a conversation about the different implementations that we can solve them with. But what you’ll quickly realize is that all of them are complicated.

[00:43:08.210] There’s no easy solution here, but this comes back to should you be in that situation in the first place? And to me, that’s the more interesting conversation. And many organizations should not be in the situation in the first place because they should not be using micro services. Right. Because micro services themselves are a giant mess of problems that range from networking to deployment to code management to mono-repo or poly-repo. I mean, it’s like we can go on and on and on.

[00:43:38.510] So again, yes, you’re right that you brought me on to talk about service mesh, but I’m having a more high level conversation, which is more about being pragmatic, about technology adoption.

[00:43:49.490] – Ethan
But one of the things that we’ve talked about on the show is complexity, though, because it does seem to never end. It’s funny how Observability keeps coming up. The thing you need to solve the problems in your complex infrastructure. It’s like this chicken and the egg problem. You don’t actually have it as well as you should, the observability aspect of it. You can’t even see what’s going on. And your complicated infrastructure is it’s really complicated to solve the observability problem. Just, for example, of one of the key pieces that seems to be missing here right now.

[00:44:16.220] – Matt
And it’s complexity begets complexity. Yeah. You try to get your observability in place, but depending on how scalable your architecture is, the observability component also becomes very complicated. And then you have too much data and then you don’t you lose your observability. And this is just crazy.

[00:44:35.480] – Ned
Turtles all the way down.

[00:44:36.590] – Ethan
Maybe there’s a there’s a way we can we can kind of compartmentalize this. Are we ending up in a world in IT where there’s just going to be more than one way to build your application?

[00:44:46.340] – Matt
I think so, yeah. So I think for right now and again, I think that if you if we fast forward 10 years from now, I believe that the vast majority of applications will be written on some type of functional platform. Like, again, we’ve we’ve talked about this during this podcast. But I just think that’s a no brainer, because as an application developer, if I ask myself, what do I what do I want for an application platform?

[00:45:14.240] I want to just have code and I want it to run and scale. I don’t care how it runs now. There’s always going to be some platform where maybe that functional platform doesn’t scale as well as it could, although over time I’m sure it will. So, no, I just do not think there’s going to be a one size fits all, and the part of it that makes it really complicated is, as we’ve talked about also, is that you can’t let go of legacy.

[00:45:41.960] And the part of what I find and again, we come back to the surface mesh conversation is that we have lots of different surface mesh implementations. If I make certain assumptions, if I assume that my application is only on Kubernetes and that there are there’s nothing else like it’s a kubernetes only application is vastly simpler to make a service mesh implementation that makes a bunch of assumptions and does a bunch of things and that’s great. And that may be a fantastic solution.

[00:46:10.730] I’m not saying anything wrong with that. Right. You’ve simplified the problem and you’ve made it much simpler and more opinionated and therefore you can have a solution by being more opinionated that is more streamlined and simpler. And that’s great. And I applaud those solutions. You have a bunch of other solutions out there which have said, you know what, the world is a messy place. They call it Brownfield. Right. It’s like you have 20 years of legacy that ranges from mainframes to virtual machines to kubernetes.

[00:46:41.690] And it’s a mess. And we have to have technology, particularly on the networking side, that’s going to that’s going to bridge that. And by definition, that is insanely complicated. And there’s never going to be a product that has zero knobs that just magically figures out how to bridge that insane architecture. So so, no, I just don’t think there is ever going to be one solution. I think that over time, more and more applications we move from racking. Right now, they basically no one racks new servers. It’s just not done anymore. So we’ve moved beyond racking. We moved into virtual machines. Now we’re slowly moving into containers. So over time and as old applications age out, I think inevitably certain technologies will be relegated to the providers. But that takes a long time. And I think that there’s probably a multi decade overlap in which we have different types of technologies that are running at once. So there will always be people that are bleeding edge.

[00:47:47.000] The people that wrote their application for Lambda and that works great for them. Fantastic. And then and then there’s people that are trying to migrate two or three decades of old technology onto something new and have them talk to each other. And that’s where it’s the talking to each other part. So so the networking part, the observability part, that gets typically the ugliest because I can have mainframes over here and I can have kubernetes over here. And if they don’t have to talk to each other.

[00:48:13.490] Sure. But the minute that I bring in networking and security and policy and all of those things, it gets really messy. And no, I don’t think there’s going to be one thing that will solve everyone’s problems. So it’s going to be a range of technologies. And again, this just comes back to if an organization can do things with a simple technology or a single piece of technology or be as opinionated as possible, they should do that like that.

[00:48:40.280] That is what I recommend. So but it’s hard to make any definitive statements about what people should and shouldn’t do. I think every organization really my guiding principle is what is the oldest and simplest way that I can get something done, because all this will be typically most of the bugs will have been sorted out. Not that there won’t be other bugs, but I mean, many of the bugs will have been sorted out. And simples means just less moving parts that I have to maintain more commodity, more ways of outsourcing.

[00:49:13.460] And so so I would always encourage people to choose the simplest and most mature technologies until they decide that that does not work for them.

[00:49:22.610] – Ned
That that’s a fantastic point. And I couldn’t agree more based off my experience with technology so far. Well, Matt I think we have sadly run out of time. But this has been an amazing conversation and given me a lot to chew on personally. If folks want to know more about you, where can they follow you? Is there anything you’d like to plug?

[00:49:45.800] – Matt
I’ve got a website mattklein123.dev where I list my various communication channels. I do public office hours. So happy to happy to chat with people about these topics or anything else. I’m active on Twitter. Feel free to reach out on any of those venues.

[00:50:02.390] – Ned
Awesome Matt. Thank you so much for being a guest today on Day Two Cloud.

[00:50:06.110] – Matt
Thank you for having me.

[00:50:07.340] – Ned
Absolutely. And hey, listener virtual high fives to you for tuning in. If you have suggestions for future shows, we’d love to hear them. You can hit either of us up on Twitter at ecbanks or ned1313. Or you can fill out the form on my fancy website, nedinthecloud.com.

[00:50:23.480] If you’ve got a cool cloud product or something that makes service mesh more palatable, why don’t you share it with our audience of IT professionals by becoming a Day Two Cloud sponsor, you’ll reach several thousand listeners, all of whom have problems to solve. Hey, maybe they’ve done the cost pain analysis and determined that they’re that your product could fix their problems. But, you know, they’ll never know unless you tell them about your amazing solution. You can find out more at packetpushers.net/sponsorship. Until next time just remember, the cloud is what happens while IT is making other plans.

Episode 82