Follow me:
Listen on:

Day Two Cloud 174: Building Kubernetes Clusters

Episode 174

Play episode

On today’s Day Two Cloud podcast we walk through how to build a Kubernetes cluster to support a container-based application. We cover issues such as what constitutes a minimum viable cluster, rolling your own vs. Kubernetes-as-a-service, managing multiple clusters, pros and cons of bare metal vs. running clusters in VMs, design recommendations and gotchas using a cloud service, and more.

Our guest is Michael Levan, an infrastructure engineer, consultant, content creator, and Pluralsight author. Michael is also the host of Kubernetes Unpacked, a new podcast in the Packet Pushers network.

Sponsor: Dell Livestream

Join the Packet Pushers and sponsor Dell Technologies on December 13th for a Livestream event on DPUs and the future of distributed infrastructure. We’ll have six short, informative sessions on topics including what network engineers need to know about DPUs, accelerating distributed workloads on DPUs, how VMware’s Project Monterey will affect infrastructure, and more. Sign up for this live, free event at

Show Links:

Kubernetes Unpacked – Packet Pushers

@TheNJDevOpsGuy – Michael Levan on Twitter

Michael Levan on LinkedIn

Kubernetes For Network Engineers – YouTube

Service Mesh And Ingress For Kubernetes – YouTube – Michael’s blog posts


[00:00:01.210] – Ethan
Join the Packet Pushes on December 13 for a live stream event on the future of DPUs and infrastructure. Sponsored by Dell Technologies, we’ll talk about how DPUs accelerate workloads, what network engineers need to know about DPUs, operational business benefits, and more. Sign up now for this free livestream at Packet Slash Livestream team.

[00:00:28.990] – Ethan
Welcome to day two. Cloud. We got part one of a two part series on deploying Kubernetes for you. Our guest today is Michael Levan. He’s a leader in Kubernetes and containerization. You can find out all about him on Michael Levon net. And in this conversation, we go we had a long conversation about the building of clusters, and it felt like we just scratched the surface.

[00:00:53.320] – Ned
Oh, absolutely. And there’s probably there’s a reason that there’s a whole other podcast on the Packet Pushes network dedicated to Kubernetes is because this is a broad and wide topic, but we tried to do just a general overview of what goes into building a cluster both on premises and in the cloud because they are slightly different.

[00:01:12.110] – Ethan
They are slightly different. And enjoy the wisdom of Michael Levan as he explains it to you. Michael Levan, welcome to the show. And I think this might be your first time on Day Two Cloud, so would you tell the nice people listening who you are and what you do?

[00:01:26.310] – Michael
Absolutely. Yeah. So I do everything in the Kubernetes and Containerization space right now. And with that space, you get into TerraForm and CI CD and all the different clouds and all that good stuff. So anything from consulting to content creation, podcasting, writing books, speaking at conferences, and all that fun stuff and everything in between.

[00:01:47.060] – Ethan
And podcasting. Yeah, I know. One of the reasons we wanted to have you one was so you could tell people about your new podcast that’s on the Packet Butchers podcast. Now we’re kubernetes on packs. So tell us about it.

[00:01:57.070] – Michael
Absolutely. Yeah. So one of the things that I try to do with my content in general to differentiate it is I want it to be something that people can use in production. Instead of having a whiteboard with your architecture diagram, I want you to be able to slap one of my blog posts up there and be like, oh, yeah, this is what we need to do. Very similar for the podcast. So people that come on the podcast are engineers, our CTOs VP’s, everybody in between, essentially people that are actually utilizing Kubernetes in production, whether it’s in the cloud onprem the different tools that you’re using. So the whole podcast, the idea of it is actually taking one of the episodes and figuring out a problem that you have in production, figuring out something that you’re trying to implement in production.

[00:02:40.990] – Ethan
All right. Lots and lots and lots of stuff going on there that you are sharing with Kubernetes unpacked. And they’re not long episodes. They’re not super long. You’ve been keeping them to roughly half an hour or so, I’ve noticed, right?

[00:02:54.220] – Michael
Yeah. And I think that’s kind of the whole idea. I don’t want it to be a super long episode because at the end of the day, there’s always going to be bite size problems that we’re all trying to figure out if we’re working on something in production. Chances are the thing that we’re trying to figure out is a five minute problem, but it takes us 20 hours to figure it out because that’s just the way that it goes in engineering. That’s kind of the goal there. The goal is to save you the 20 hours and pop it into 30 minutes for you.

[00:03:22.740] – Ethan
Perfect. Yeah, I’ve been enjoying that show a lot. It’s one of my must listens in my podcast theater, which is overcast, overcast, or percolated up new shows. Michael with another one, and I’m usually too behind because you keep cranking them out, man. Well, this show, and actually the part two that will be coming up next week, is about deploying Kubernetes. And this is aimed at infrastructure engineers, those of you that are listening, that you tend to be builders. You put systems together, platforms that applications run on top of this. This episode is for you focused around Kubernetes, and we’re going to talk in this episode about building clusters. So, Michael, I think we need to start at the beginning here for people that maybe don’t know Kubernetes all that well. Would you define a Kubernetes cluster?

[00:04:07.840] – Michael
Yeah, so I think there’s kind of two definitions in terms of what Kubernetes is. So from a holistic perspective, it’s simply an orchestrator that takes a container and schedules it for you, puts it in a certain location, then your containerized application runs. Kubernetes in itself, from a clustering standpoint, is you primarily have two components. You have the control plane, which is where your API runs, which is where different types of containers can run depending on if you’re running on prem or if you’re running in the cloud. You have your proxies and your schedulers and Etsyd and all of that good stuff, which is your database. And then you have the worker nodes, which is where the actual containerized applications run. So, for example, if you have a pod running, it’s running on a worker node, it’s not running on the control plane.

[00:04:56.890] – Ned
Okay, so those worker nodes are members of the cluster. And would those worker nodes also be running portions of that control plane, or do you typically reserve special nodes for the control plane components?

[00:05:11.110] – Michael
Yeah, so the worker nodes never run the control plane components, but at the same time, the control plane components can be split up. So to try to make this as least confusing as possible, when you’re running in the cloud or something like that, for example, you don’t manage the control planes like EKS, AWS, etc. Or you’re not managing the control plane that’s managed by the cloud for you but let’s say you’re running on prem. You can have an Etcd pod, it’s actually running as a pod on your control plane. Or you can have an entirely separate server that’s running Etsyd. So there’s a bunch of different ways you can even have your API server split up. You know, you can have Etsy split up, so you can have all these components split up, or you can put them in one location on the control plane. So it kind of depends on the architecture and ultimately how large the cluster is.

[00:06:06.410] – Ethan
Well, for people that just got overwhelmed by thinking about all the different architecture decisions that they might need to make, let’s go strip it down to the simplest. What would a minimum viable cluster look like?

[00:06:17.660] – Michael
So the two answers are there would be either from a cloud perspective or from an onprem perspective. So in the cloud, luckily, there’s really nothing you have to do from a control plane perspective. It’s all abstracted away from you. It’s all managed by the managed service that you’re using, whether it’s on GCP, Azure, AWS, wherever. So that’s pretty much what it is. They handle it for you. From an onprem perspective to get up and running, the easiest way, in my opinion, is to use something like Qbadm, which is a bootstrapper, a lot of like AKS, they use Qadm in the background to bootstrap, Kubernetes clusters and stuff like that. But when you do that, what happens is all of the control plane components all go on the control plane. So Etcd is there, the scheduler is there, the API server is there, it’s all under one control plane. And then of course, you can have multiple serverless to even take a step back for a second. From an infrastructure perspective, this is the type of architecture that we’ve seen in the field for years. You have multiple servers, the servers are running pieces or an application, rather write components and you scale those serverless out.

[00:07:28.300] – Michael
Maybe you need two, three, four, et cetera. So from an infrastructure architecture perspective, we’ve kind of always been doing this. We’re just now calling it Kubernetes and a TD and the API server.

[00:07:39.900] – Ethan
Yeah. So if I’m at home and I want a lab up, a minimum viable Kubernetes, so something that would look a little more like production and not like I know there’s tools like mini cube and all that. We can talk about those later. But if I want to emulate something more like I’d see in production, what would I be in for? Like three or four virtual machines with control plane and worker node functions split across them? Something like that?

[00:08:04.930] – Michael
Yeah, so you always want to have at least two control planes. And then from a best practices standpoint, you want to have at least three worker nodes. So if you’re trying to run that at home, figure you got five VMs. But if you want to just get it up and running to figure out how it works. Because here’s the thing. You could have one control plane, one worker node, and add more control planes and more worker nodes later. So if you don’t have the capacity on your server or whatever the case may be, that’s perfectly fine. You can get it up and running on two VMs. Literally. Like, I have a laptop to the left of me that I’m running HyperV on to Windows Eleven box or Windows Ten, and I have a bunch of control planes there, I have a bunch of worker nodes there that I’m running in that literally emulates production. I mean, obviously you’re not running a laptop in production, but it’s the same path. It wouldn’t be a different path whether I’m running it on my laptop or I’m running it in production. It would just be a matter of number one, how you’re deploying it and number two, how many control planes of worker nodes you have.

[00:09:00.180] – Michael
But yeah, from a best practices perspective, you always want to have at least two control planes just from a failover perspective and then at least three worker nodes.

[00:09:07.690] – Ethan
Sounds like if I can even run it on a laptop, I don’t need to have tons of CPU and Ram to pull this off.

[00:09:13.770] – Michael
Yeah, I think right now I’m running you need a minimum of two gigs of memory. And from a virtual CPU perspective, I think I’m giving them like two each for the VMs. Yeah.

[00:09:25.660] – Ned
So a decent laptop, but not overly taxing, and if you have a desktop or you have a little mini lab in your home, you can probably easily run it on that as well and get that sort of production like experience. What about a typical cluster that’s actually deployed by an organization? You mentioned some of the guidance of being the two control plane servers and three worker nodes. Is that something that you would scale larger if you were deploying a cluster to be used in production by an organization?

[00:09:59.890] – Michael
Funny enough, it’s going to depend on what applications you’re deploying and how many. So if you have, for example, three pods that you need to deploy, maybe you have a front end, back end, middleware, something like that, that you’re splitting up. Having two to three worker nodes and two control planes in production is perfectly fine. But then if you get into, for example, you know, MercedesBenz, their tech team, they’re running over a thousand Kubernetes clusters on OpenStack and you know, they obviously have multiple worker nodes and multiple control plans. So it’s going to definitely depend on what you’re deploying and the size. If you got 100 to 200 pods that you got to deploy, yeah, you’re probably going to want a significant amount of worker notes. But if you’re just like getting to the point where your organization is now moving to containerization and Kubernetes, chances are you’re probably going to only have a few, right?

[00:10:52.780] – Ned
So if you’re just dipping your toe in the water. You can start on the smaller side, and you can both scale up the individual nodes, whether it’s control plane or worker, or you can scale out as needed. But it’s probably most important to get the basics, the fundamentals right of how it’s set up, rather than focusing too much on number of servers and CPUs and Ram.

[00:11:13.330] – Michael
Yeah, absolutely. Yeah. Because at the end of the day, if you’re doing something on Prem, you can add more control planes, more work or notes whenever you want, as long as you have the VMs available. From a cloud perspective, you literally just click the button that says auto scale, and it just does it for you, and you don’t even have to do anything. It’s a pretty simple path, I would say, in terms of scaling out. The not so simple piece is actually getting it up and running, production ready, and from a learning perspective of actually figuring out how it all works.

[00:11:48.560] – Ethan
Well, do you think I should roll my own on Prem or should I use some kind of a cloud managed service for Kubernetes?

[00:11:54.040] – Michael
Michael, there’s the $2 million question, depending on how large your data center is. So I think that in today’s world, usually I think running something in the cloud is probably going to be the path to go. However, it’s always going to depend on the circumstances. For example, I was speaking to a colleague a couple of days ago. He works in the defense space, and because of that, there are certain areas of the world, or of the US. Maybe world, not sure, that he needs to run on Prem Kubernetes clusters, because it’s just a regulatory thing. It’s just a compliance and security thing because it’s the government sector, there’s just certain areas that he needs to run them on Prem, and then he has other clusters that are running in the cloud. So it’s definitely going to depend on your environment overall. But if you’re an organization that needs to meet maybe simple regulatory terms, simple compliance, maybe you don’t have a specific need to run it on Prem. You know, a lot of the times organizations are going to the cloud.

[00:13:05.660] – Ned
Okay, yeah. I mean, we’ve heard stories of cloud repatriation where they started in the cloud and then realized from an economic standpoint, maybe this is a steady state sort of thing that we can just host ourselves and that ends up being cheaper. It certainly seems like if you’re a startup, you start in the cloud, or if you’re just dipping a toe cloud’s easier. But then as you scale up, you may hit that fulcrum point where it makes sense to bring it on prem. But then you need all the engineering talent to successfully manage the control plane and the underlying components that stack together to build it.

[00:13:42.690] – Michael
Exactly. Yeah. Again, if you look at, like, Mercedes, for example, they’re running over a thousand clusters they probably can’t run it in the cloud because of quotas, because of limits that they may reach in regions and different VPCs. And whatever the case may be in the cloud, because you always have an X amount of services that you can run in each region or whatever the case may be. Maybe they need to run it in certain places because of that. So it’s like, yeah, it’s always definitely going to depend on where you’re actually earned. Not where, but what you’re actually running and why you’re actually running it. But yeah, like you said, even from a startup perspective, no startup with five people wants to go, like, deploying the data center. They want to just log into AWS or Azure and click a couple buttons or run some TerraForm code and that’s it. Right?

[00:14:28.540] – Ned
And arguably, in some cases, they don’t even need Kubernetes at that stage. They could work with just deploying the containers through one of the other 150 different ways you can deploy containers in the cloud. I’ve lost count.

[00:14:43.540] – Michael
Yep. I actually have a client right now that I’m working with. They’re a small company. It’s actually, funny enough, I worked for the CEO years ago and he started a new company and now I’m consulting for his new company. And we were kind of having the discussion around is the company ready for Kubernetes or not? They want to containerize everything, of course, but are they ready? And ultimately what it ended up coming down to was we went with ECS, so Elastic Container Service in AWS, which is it literally does everything that Kubernetes does. It schedules pods out or containers out. It does all of the self healing and all of that fun stuff. I would call it Kubernetes Lite almost, where you don’t have to worry about all of the the service meshes, you don’t have to worry about all the CNN’s, you don’t have to worry about all the different ingress controllers and all that. It’s just running as a service in AWS. And that’s sometimes a better path for startups that don’t want to dive into the whole Kubernetes arena because it’s complicated.

[00:15:44.070] – Ned
Yeah, you mentioned the MercedesBenz thing a few times. And thousands of clusters. I’ve seen the case study as well. I’m curious, do they treat those clusters as permanent untouchable artifacts or are they treating the clusters kind of like you would typically treat a pod, which is ephemeral and without states? So are they just like spinning up clusters and then getting rid of them as quickly as they want? Or do they have 5000 steady state clusters?

[00:16:16.390] – Michael
Yes. So it didn’t exactly say in the article, and I’ve Googled around and stuff as well to try to get a little bit more information and it doesn’t say specifically. But what I can imagine is this in Kubernetes, or just in general, usually have single tenancy and multi tenancy. Now single tenancy, multi tenancy, it could be one user or one application, or it could be a group of users or a team or multiple applications. So what I imagine a nice chunk of those thousand Kubernetes clusters are, hey, we have five people on the Dev team. They all need a Kubernetes cluster to be able to test something. We’re going to run our automation. I believe they’re using cluster API for all the deployments and stuff and boom, now you got five new clusters, and then once they’re done, they spin them down. So I imagine I’m assuming, but I imagine a lot of that is happening from a single tenancy perspective.

[00:17:15.040] – Ethan
Okay, so I guess the guidance to whether or not I need more than one cluster comes down to isolating workloads to specific teams. Is that typically what you’re seeing?

[00:17:26.210] – Michael
It could be. It could definitely be one of the pieces. I see that a lot, where let’s say you have five Dev teams, maybe some front end, some back end, middleware, et cetera. They may all have their own Kubernetes clusters, or each person on the team may have a Kubernetes cluster. From a single tenancy perspective. You don’t see that a lot in smaller organizations, but you will see it a lot in larger organizations. Because if you’re just testing new code, if you’re testing a new container image, if you’re testing a new implementation, they’re going to want to segregate that as much as possible versus having one dev cluster where everybody’s just shooting container images out and everything’s not working the way that people are expecting. And from a production perspective, yeah, you’re going to see multiple clusters. But and I don’t know how accurate this is, but I could imagine just from what I’m seeing that it’s pretty accurate. Only 10% of organizations are using 50 or more Kubernetes clusters, so a lot of organizations are like, using between five and ten, if that.

[00:18:33.860] – Ethan
Okay. So this feels like it’s a resource contention concern. If you want to make sure that the Dev team isn’t eating resources from the QA team or something, you build their own cluster for them. When we say build their own cluster, to me that means it’s bare metal that’s now got all the Kubernetes components built on top of that CPU. That Ram is dedicated to that cluster. And so I don’t have to worry about resource contention in that scenario, right, right.

[00:19:00.990] – Michael
Yeah, it’s that and then it’s a lot of segregating workloads. Let’s say you got version one two of a pod that’s running a back end application. You want to be able to test version one three of your new front end application with version one two of the back end. You don’t want another developer updating version one two to one three and one four because you want to be able to test certain versions with other pieces of the application. Hence the whole idea of microservices, of just having the ability to not have dependencies on different pieces of the application. So a lot of that as well. It’s a lot of segregation from that perspective.

[00:19:40.020] – Ethan
Hopefully from a reading I’m doing about microservices, I can say that 90% of companies that deploy microservices 100% regret having done so.

[00:19:49.840] – Michael
Yeah, it’s definitely a little bit tricky and I also think that microservices are a funny thing because the whole idea is to not have dependencies but at the end of the day you always have a dependency on something. Like there’s always a dependency on something. So it’s far fetched in a sense. But I think that with micro services you can get of the way there of not having dependencies on certain things but again, you’re always going to have some dependency somewhere.

[00:20:19.010] – Ned
Yeah, it’s that eternal quest to avoid lock in that we all talk about whether it’s vendor lock in or platform lock in. But at some point you have to make a choice and once you’ve made a choice, you’ve created some kind of dependency and we just have to accept that it’s a cost benefit analysis and sometimes the pros outweigh the cons to lock yourself in or couple yourself to something. I want to pick apart something that you mentioned a few moments ago in terms of clusters versus what I thought of when I thought of segregating different workloads is namespaces, right? I can use namespaces to separate out different applications or different teams or whatever, but I think the point that you were making is that there’s some common components on the cluster that are going to be managed and maintained. And maybe not everyone who’s using that cluster is ready for that upgrade or that change to the shared components. Is that usually why you would break it out to other clusters instead of having everything housed on a single massive cluster?

[00:21:21.120] – Michael
Yeah, I would say so. But the other thing about from a namespace segregation perspective is how can I put it? It’s not 100%. So, like, for example, let’s say you have five clients and you want to be able to segregate them. You wouldn’t put them in their own namespace because there azure always ways to have pods talk to other pods in different namespaces and then also from that perspective you would have to set up this user can only talk to this namespace or this team can only talk to this namespace and then you get into very deep from an RBA perspective which a lot of organizations don’t want to manage at that namespace level. And then you also have the pieces of resource constraints. Then you would have to set up requests and you would have to set up limits on each namespace and say this can only get this amount of CPU in memory. This namespace can only get this amount of CPU in memory. So the idea of segregation like when it comes to namespaces isn’t what you would think in terms of like if you have this in this namespace, the pot over here can’t talk over there.

[00:22:24.600] – Michael
Oh, no, it definitely can’t. There’s definitely communication from a network perspective between namespaces, right?

[00:22:30.990] – Ned
That’s one of the reasons that service mesh ends up getting deployed, is to create that segmentation between not even just namespaces, but between deployments and different pods. Pods with this label can’t talk to pods with this label. And you can force that via service mesh. Whereas with Vanilla Kubernetes, it’s like, now everybody talk to everybody girlfriend, right?

[00:22:52.110] – Michael
Exactly. Yeah. And that’s even where OPA comes into play. So the open policy agent, if you deploy Gatekeeper and gatekeeper is like the intermediary. So Kubernetes knows how to communicate with OPA, but things like that where you deploy these policies that you can say this namespace, can’t talk to that namespace, this namespace can’t deploy anything with the latest container image version or tag rather. So, yeah, there’s definitely tools out there that help with that, for sure, 100%. But then it just comes down to the question of is an organization ready for service mesh? To be frank, a lot of aren’t, because it’s a lot that goes into it. Same thing with OPA. You need to have at least a dedicated person implementing your policies for you, and then you have to have that person communicating with security and ensuring that you’re following proper compliance needs. So I don’t want to beat a dead horse, but it’s like, you know, a lot of these tools and these third parties that you can implement in Kubernetes do help with those things. But then it’s a matter of, do you have the capacity, aka people, to be able to actually work with it?

[00:23:59.210] – Ethan
We’re pausing the conversation for a quick word about the future of DPUs and It infrastructure at the Packet Pushers Live Stream event on December 1322. DPUs, or Data processing units, are special purpose hardware that run in servers to accelerate network security and storage functions. DPUs are creating new opportunities and challenges for distributed architectures. You can learn about DPUs and their impact on infrastructure and operations at our Live Stream event sponsored by Dell Technologies. The Livestream features six technical sessions hosted by the Packet Pushers on topics including what network engineers need to know about GPUs, how Dell is integrating DPUs into hyperconverged infrastructure such as VxRail, and how VMware’s project Monterey brings a software environment to DPUs so you can run essential virtualization storage, security, and networking services. Sign up for this free live event taking place Livestream. We’ll see you on December 13, 2022. And one more time, it’s PacketPushers net Livestream. And now back to the podcast.

[00:24:59.440] – Ethan
Michael, I want to talk about some design recommendations for people that are building a cluster on Prem. I want to contrast On Prem with an as a service Kubernetes. So let’s start with on Prem. Do you have some general cluster design recommendations for us?

[00:25:14.890] – Michael
Yeah, so I would say getting started, you always want to have at least two control planes, you always want to have at least three worker nodes. And from a resource perspective, how can I put it, depending on how many workloads Azure deploying, because every pod that you deploy, it’s going to take CPU and it’s going to take memory. So at the end of the day, you don’t want to overutilize and you also don’t want to underutilize. So from how much resources you need, like how much memory do you need, how much CPU do you need? I don’t have a recommendation there flat out, simply because it’s all going to depend on how many workloads Azure deploying. If you’re deploying ten pods or 100 pods, it’s going to be a significant difference in terms of how much resources you should have. Now that’s just from a cluster perspective. Did you also want me to answer from like an automation perspective of like how to actually get these clusters up and running?

[00:26:14.140] – Ethan
Well, let’s stick with the physical for a minute because I actually have a follow up question to that. Since you were talking about basically hardware resources. Would I deploy my Kubernetes nodes, I guess as VMs or bare Metal?

[00:26:30.110] – Michael
I would say right now VMs. I think that there would need to be an incredibly compelling reason to deploy five physical serverless to control planes, three worker nodes. To be honest, I don’t even have like I’m trying to think in my head if I have a scenario where that would be needed. Unless you have some, you know, how.

[00:26:54.610] – Ethan
Hypervisor is going to share resources and so on and so on. So if it was a severe resource hog, whatever the workloads are that you’re deploying, I could see Bare metal being more attractive maybe, but operationally, yeah, I want it to be VMs all day long, right?

[00:27:14.230] – Michael
Yeah. So I would say my counter argument to that would be if you have resources that are consuming that amount of like memory, CPU, or just resource in general, you got a problem in your code and you should go fix it. Yeah, you definitely shouldn’t have that type of problem. You know what I think? Thinking about it, the only scenario that I would see that is like for example, if you want to run Etsyd maybe, which is like the database or Kubernetes on a separate server for some type of regulatory or compliance. Needs, maybe to protect that data or to put that data in another area outside of the servers that are running the other pieces, like the API server and the scheduler and stuff. And then the pods. But it would have to be like a real specific need for that for whatever reason. And I can’t even think of the compliance items or regulatory names that would be needed for something like that. I’m throwing a guess out there for something like that. But yeah, I mean, I would say 99.99% of the time, if you want to segregate, like, let’s. Say Etsyd or your control plane, other control plane components or whatever, you could, you would just run them on different VMs.

[00:28:34.240] – Ned
I could see an argument being made for bare metal in terms of efficiency. If I don’t want to pay the VM tax, the overhead of having the hypervisor, or if I’m concerned about noisy neighbors. Usually VMs are pretty strange around how much one virtual machine can impact another on the hypervisor. But if you’re really concerned about that, I can see the bare metal provisioning portion of it. But then again, that assumes that you have a really good robust provisioning process for your bare metal machines and that you’re able to automate a lot of the actions that Azure required for building and maintaining them. And definitely not everybody does have that.

[00:29:15.190] – Michael
Yeah, and then even actually, as you were just saying that, another piece came up in my head. There even could be some type of compliance need that for example, from a government perspective, maybe you might have servers that aren’t reaching out to the Internet 100% of the time, so they aren’t going out and pulling updates. You’re setting up specific times where, hey, at between 01:00 A.m. And 02:00 A.m., we’re turning on egress and ingress so we can update the container images from wherever our registry is. So maybe from that perspective you might want to have X amount of worker nodes or something running on servers that are not connected to outbound from an ingress and egress perspective to actually get out to the Internet. Maybe something like that. I have heard that it’s something that’s usually called at the edge, right? So if anybody has heard of edge computing, I see that a lot, where it’s like, you know, you have certain Kubernetes clusters that aren’t reaching out to the Internet and that aren’t pulling down updates. It’s only happening in specific times and specific intervals. So maybe something like that you might want to run on a separate server or whatever the case may be.

[00:30:22.710] – Michael
But then I guess the argument there is like, well, just set up maybe different VLANs or whatever the case may be to segregate that traffic.

[00:30:33.040] – Ned
Now the actual nodes that make up the servers that make up my cluster, should they all have identical hardware or is it okay to mix and match or can I have some nodes in my cluster, have specialized hardware or something like that?

[00:30:48.660] – Michael
Yeah, so let’s say, for example, you have an application that’s running in Kubernetes that’s like very graphical intense, right? Like maybe you might want to have servers that have specific graphics cards. Or maybe if you have applications that are very like memory heavy, like maybe you’re containerizing Java applications that are very memory heavy, you would have separate worker nodes that you would put those pods on to utilize a bunch of memory. So for example, let’s say you had five worker nodes. Two of them were for pods that were running, or containerized applications that are running that are very GPU or graphics centric. And then the other three, Azure maybe for memory centric for those applications that need more memory. So you can absolutely mix and match like that. And even in the cloud you’ll start to see that. I forget if it was Digital Ocean or Linode, maybe both. But when you’re setting up your worker nodes, you have options there that like for high memory, high CPU graphics, et cetera.

[00:31:44.770] – Ethan
And then Kubernetes as a scheduler will know enough to know which node it should put a particular container on.

[00:31:51.330] – Michael
You would set that up with something like labels. Okay, so you would tell the pods what nodes that they should be running on. You would set up those constraints.

[00:32:01.610] – Ethan
Well, let’s talk about automate the automation component then for if I want to automate the building of a Kubernetes cluster. Do you have well, I mean, the manual process feels like you build a Linux box, you throw Kubernetes on and then you add it to the cluster kind of a thing. But then you mentioned earlier in the show like a way to automate some of this, which is better automation I’m assuming is going to be better, right?

[00:32:28.610] – Michael
Yeah. So, I mean, here’s the thing. In today’s world, nobody is just going to bootstrap a Kubernetes cluster. Just like manually setting up the certificate authorities, setting up the scheduler, specifically Et, specifically they’re going to meet somewhere in the middle, which is like a little bit of automation, but also a little bit of like a raw deployment that would be Qadm. So you would use Q A DM to bootstrap the actual process of getting Kubernetes up and running. Now in terms of the automation piece of like building the infrastructure, it’s going to be the same thing as, you know, any other server environment. Maybe you’re using TerraForm, maybe you’re using Ansible with Kubernetes specifically there is to not go too deep and I mean, I can of course, but it’s declarative. Like the API is declarative. So you have certain tools like cluster API for example, that is like TerraForm. It performs Crud operations, but it does it in a declarative fashion. So it’s more like native to Kubernetes. So there’s different tools and stuff like that, that you can use. There’s also different clients that you can use. For example, you can use maybe Poloomi for example, to deploy a Kubernetes cluster.

[00:33:44.210] – Ethan
Okay, very good. So let’s move from the on premises approach to I’m using a service, some kind of a cloud service to build my Kubernetes cluster design recommendations there. You made it sound earlier like it was just hit the easy button.

[00:33:59.060] – Michael
I would say that it definitely feels like hit the easy button and in most cases because number one, so again, thinking about Kubernetes from a high level perspective, you have the control plane and you have the worker nodes. So half of that when you run in, the cloud is taken away from you and you don’t have to do anything. So the control plane is completely abstracted away from you. Now there are certain things that you still want to do with the control plane. For example, you still want to pull audit logs, which you can do. You still want to maybe set up the metrics endpoint so you can consume your observability metrics with like Prometheus or something like that. But the actual management of your control plane is completely abstracted away from you. So you only have to worry about your worker nodes. Now from a scalability perspective, you literally just click the button that says Auto Scale. For me, max is ten, minimum is two. Figure it out as I deploy pots. But then there’s also the pieces of like those worker nodes still need updates. So if you’re running a boomto boxes, you still got to update them.

[00:35:04.650] – Michael
You still got to make sure that they’re secure. You still got to do all that stuff. At some point you’re going to have to update the Kubernetes API. You got to do that yourself. But it’s arguably way easier than doing it on Prem, for example.

[00:35:20.100] – Ethan
Oh, that’s interesting. I would have guessed that the provider was managing some of those things for me. Like I patch Ubuntu boxes a lot. It’s super annoying. I would think Adam as a service would be handling that for me, but not so, eh?

[00:35:32.670] – Michael
Yeah, but there’s also a way to move away from that if you want to. I like to call it Serverless Kubernetes in a sense. So, for example, in AWS there’s something called Fargate profiles. And what Fargate profiles do is instead of having to run EC two instances for your worker nodes, you don’t run EC two instances anymore. So there’s literally no nodes that you’re managing because the Fargate profile does it in a serverless fashion. So at that point you’re literally not like there’s no servers anywhere that you Azure managing. So at that point it’s all abstracted away from you. So there are certain ways to get around it. There’s also something called ACI bursting in azure. So in Azure Kubernetes service, you can actually push pods out to Azure container. Instance, there is Azure container apps that are coming out now that I definitely don’t think are ready for production yet, but there Azure certain pieces there that you can utilize. So for example, if you want to test an application, you can use Azure Container apps instead of deploying a Kubernetes cluster. From a single tenancy perspective in the cloud, there’s different ways to get around the same things that you would have to do on Prem.

[00:36:50.590] – Ethan
Okay, as I’m standing up my Kubernetes cluster in the cloud, are there cost concerns or other gotchas that could impact my bill? Of course.

[00:37:07.460] – Michael
I think it’s actually funny that we’re talking about it because I’m sure we’ve all seen on social media and stuff. Like there’s a lot of talk and content going around right now where it’s like people are questioning the cloud and what it was giving to us. I forget which company it was, was it base camp? But somebody recently put out a blog post that it was like, we’re moving all of our workloads off the cloud. It didn’t give us the benefits that we thought that it was going to give us. So we’re seeing that a lot too. But yeah, I think that’s the uncomfortable reality with the cloud is like, you are going to pay a premium. You can figure out what that premium is. There’s the cost calculators and the billing that you can see in the forecasting and all that. But yes, absolutely. But on the flip side, if you don’t, you’re going to double or triple your team size to run it on prem. Right? And if we look at the average salary, for example, like around New York City for a DevOps or a platform engineer, you’re looking at between 160 and year.

[00:38:16.630] – Michael
So if you’re doubling and tripling your team, that you would need to run all the onprem stuff because you need a storage person, you need a network person, you need a server person, you need an OS person, versus pushing that into the cloud. You’re saving money from that perspective. But I guess then on the flip side, you still have to hire people that are good with Kubernetes and that are experts or well, no such thing as experts, but you need to hire people that are good engineers. So it’s like you still kind of find yourself in the middle trying to figure that out. But again, I think that would depend on the size of the team, almost in the size of the applications that you’re deploying got you.

[00:38:52.770] – Ned
Yeah, that’s what we talked about earlier, that sort of cost benefit analysis you have to do. And at a certain scale, maybe it does flip from one side to the other, but then you have the whole migration concern, which might be a little easier since you’re using Kubernetes and so it should be fairly standard. Speaking of which, we have all these different providers you’ve mentioned. AWS, they have EKS, you have Azure with AKS and then you have Google, like the OG, the original with GKE. Does it matter which provider I pick? Or are they all kind of basically the same?

[00:39:26.880] – Michael
They’re all doing the same thing. I think that there’s some little differences. Like for example, with Azure, let’s say I take an Istio service mesh and I have the commands that I need to use or the automation that I need to use to deploy. And I actually tested this a couple of weeks ago. I took the same exact approach, deployed it on EKS, it didn’t work. Deployed it on AWS, it worked. Same approach, same version of Istio. All of that. Point being, the Kubernetes API is the kubernetes API. It doesn’t change. Well, it does, obviously, as it gets updated and stuff like that, but it’s static almost, but because it’s open source. So once the cloud provider gets a hold of it, they do little tweaks and little things on the back end that we don’t know that alters it in a sense. Like, for example, in Azure, right off the bat, the metrics API endpoint on the control plane is exposed. In EKS. It’s not. So you need to deploy it and get that pot up and running yourself. So they are doing little tweaks on the back end that you kind of don’t see unless you’re like really in it.

[00:40:42.370] – Michael
With that being said, I don’t know, it’s all the same stuff. Kubernetes is kubernetes. It’s doing a job. Maybe you’ll have to, like I said, deploy the metrics endpoint in EKS, but you don’t have to in AWS, little things like that. I think another big thing is like, I often notice that the Kubernetes API versions that are available and for example, EKS are always behind compared to AKS. AKS always has newer ones compared to EKS. So little things like that, you’ll have to figure out if it actually matters to you. But overall, from a general high level perspective, it’s all doing the same stuff. It’s scheduling your pods and pushing them out. Got you. And the costs are like pretty they’re pennies from a different perspective, right.

[00:41:34.460] – Ethan
Well, for the infrastructure people out there, we think in terms of storage and networking and security and these kind of things. So I want to talk first about storage. What is my Kubernetes cluster actually storing for me?

[00:41:48.810] – Michael
Sure. So there’s going to be two parts to this. There is the actual data of Kubernetes. So, for example, let’s say you deploy a pod and it has a higher level controller. Kubernetes is made up of different controllers for your ingress, for your pods, for your volumes, for everything. It’s all an API that controller its job. So, for example, you have the deployment controller. Its job is to look at pods and say, is my current state, my desired state, declarative? But how does it know the data, how the pod looks, the CPU that’s running, et cetera, that’s being stored in Etsyd, so, etcd. Is your database. It’s storing the state of your Kubernetes cluster, how it looks. So that’s one piece of the storage. The second piece of the storage is, for example, let’s say you have a pod that needs a hard drive. In Kubernetes, it’s called a volume. Well, from that perspective, you would then spin up a volume. Now that volume could be ephemeral or not. If it’s ephemeral, no big deal. If you need to store that data somewhere, then you use something called a CSI or a container storage interface.

[00:43:12.180] – Michael
There are multiple container storage interfaces. AWS has it, Azure has it. I could literally go to Microsoft down the street and get a Synology Nas for $200 and connect my Kubernetes volumes to that Synology Nas. A lot of the sand providers now have it. A lot of just the overall storage providers all have CSIS. So those are the two pieces of your storage at a high level, of course.

[00:43:37.310] – Ethan
What about things like app state and databases and all of that? Is that going to be in that second tier there? Like, you were describing a Synology box that I could just be running Ice cuzzi Two or something, right?

[00:43:50.710] – Michael
So that would be the difference between stateless applications and stateful applications. So you can run State full applications on Kubernetes. You can also run Stateless from a database perspective. I don’t know if this debate is still going on, I haven’t seen it in a long time, but there was a heavy debate for a long time of whether or not you can run databases on Kubernetes. I think the reality is, yes, you absolutely can. Because if we think about it, what is a database stores data. So you have two options. You can either run, like a MySQL Pod, for example, on Kubernetes, scale it out, have it look at volumes that are sitting elsewhere on a Synology. Now you probably don’t want to do in production, but just an example. Or you can have pods pointing to different database back ends. So, for example, I can have my sequel running, or whatever the case may be, pointing to RDS in AWS. So like, I can have it pointing to database back ends in some service in the cloud or something somewhere. So, yeah, there’s definitely several options from that perspective. But then you have your stateless and your stateful applications, where even if you have a serverless application, you could still have volumes attached to it.

[00:45:10.930] – Michael
So, like, from a storage perspective, those are still there. Because stateless is like, for example, when you go to and you open up, you go to, a new page pops up and you type in your search and then you close it out. And then if you reopen and you go to, it’s fresh again, right? That’s a stateless application. It doesn’t store your data. Gmail is stateful, it stores everything. So you can have both style of applications and Kubernetes as well. But those don’t really have anything to do with your back end storage. You can have back end storage for both state phone stables then, where Azure.

[00:45:48.390] – Ethan
The containers themselves live. I assume that’s in some kind of a repo that is not in the cluster.

[00:45:55.990] – Michael
In terms of the containers that are running.

[00:45:57.790] – Ethan
You mean the containers that are running, right. I say, hey, Kubernetes, I want you to spin up this container in this pod.

[00:46:03.000] – Michael
Yeah, so containers live inside of pods, and you can have one container per pod, or you can have multiple containers per pod, which is typically called sidecar containers.

[00:46:16.910] – Ethan
And physically, those containers are drawn from where?

[00:46:21.790] – Michael
Okay, got it. Those containers are going to be running based off of a container image, and your container image is going to be running in some type of registry. So, for example, you can store them in, like, J Frog artifactory. You can store them in Docker hub. You can store them in ECR ACR, which is like the Azure and AWS equivalent for your container images. And then you would pull them from there. And then inside of your Kubernetes manifest, there’s a spec. And then under your spec, it says containers. And then right there is where you specify the container image, the version, the port, any volumes that you need, all that fun stuff.

[00:46:59.770] – Ethan
The big point there is those containers do not live within the cluster you are pulling out of. This some sort of repo. Yeah, absolutely.

[00:47:07.300] – Michael
Yeah. Those artifacts or container images, rather Azure stored somewhere.

[00:47:13.540] – Ethan
Go ahead. Sorry.

[00:47:14.980] – Ned
If I remember correctly, so the images can be cached on each worker node, so it doesn’t have to pull that image every time you want to spin up an identical container from an image. It can cache that image for a certain amount of time. But I think in the spec, you can also tell it either don’t cache the image or don’t use the cache copy pull Afresh every time. Right?

[00:47:33.570] – Michael
Yeah. So there’s something called in your Kubernetes manifest, it’s called an image pull policy. And you can set it up as like, if it doesn’t exist. So I think it’s literally called does not exist always or never. So, for example, the default is if it doesn’t exist, but you can set it up in your container spec to say always, so it’s constantly pulling down or never. So, like, if you say never, for example, and then you try to, let’s say, deploy, and it doesn’t exist locally, your deployment is going to fail.

[00:48:07.690] – Ethan
Now networking, we don’t have time to get into how Kubernetes cluster networking works. That’s not what this question is. But for infrastructure people who manage networks, is there something they would be doing to their networking environment to prepare for a Kubernetes cluster coming in?

[00:48:22.240] – Michael
Yeah. So I think the really good thing about networking and Kubernetes is it’s the same as everywhere else. Like networking is networking. It’s not some magic that Kubernetes is doing. There’s IP addresses, and there are Ciders and their ports and everything else. Load balancers. If you want to prepare for networking in Kubernetes, look at something called your CNI or your container network interface. It’s how you get your local Pod networking running properly. So there’s multiple different CNN’s that you can use. Some security base, some out of the box, press the easy button, and then some in between. And then at that point is where you can set up things like your Pod network, which is like your site arrange, literally. It’s like your subnet. And then you set that up and then you have certain containers that are running on certain ports and yada yada. So, point being, networking isn’t any different from a what is networking perspective in Kubernetes, it’s all the same thing if you’ve gone through your CCNA or your network, plus you’ll be just fine.

[00:49:24.490] – Ethan
Okay, so I might need to assign some address blocks. I might need to build a few VLANs, these kind of things. Nothing unusual, though. There’s nothing really overly strange going on there.

[00:49:33.490] – Michael
Okay, exactly. Yeah. The biggest thing, I think, from a networking perspective is understanding that Pod to Pod communication is open, it’s not encrypted. So if you have a need to do that, that’s when something like a service mesh would come into play. That’s the only odd gotcha when it comes to networking. And Kubernetes is it’s an open house. Everybody can talk to everybody.

[00:49:59.670] – Ethan
I suppose you could get lean into a switch that could do Mac SEC or something and do like a layer two encryption in that way that might cover some of those bases.

[00:50:07.890] – Michael
Yeah. And that’s where something like a service mesh would come into play. Or there are certain CNIS that, like, I believe Calico has WireGuard, and you can set up Pod to Pod communication that’s encrypted. Okay. And then you have like, load balancers and services and everything. In the OSI model, you got the whole stack.

[00:50:28.840] – Ethan
So now that we’ve touched on a little bit of security, we don’t have time to get into the whole Kubernetes security model either. But one kind of building a cluster related question. Once a cluster is initially stood up, is there a default security posture? Kind of some default I don’t know. Kubernetes namespaces other posture related things that we should be changing right away.

[00:50:52.910] – Michael

[00:50:53.730] – Ethan
The answer is yes.

[00:50:54.790] – Michael
To all kubernetes. Out of the box is arguably one of the most unsecured platforms out there. Yes, you need to so it’s funny, like, a lot of the work that I have coming down the pipe right now is all Kubernetes security centric. And the reason why is because people are just trying to figure out Kubernetes. And once they get it up and running, they wipe the sweat off their forehead and they go home and little do they realize that the entire thing is unsecured from who can access the cluster, to how the pods are running, to how traffic inbound and outbounds is working. Now, this is a literally, we could talk about this for I’m not even exaggerating 5 hours. Like, the course that I just did on Kubernetes security is 5 hours long. But to try to not go 5 hours, these are the three things that you should think about RBAC. So your role based access control, who can do what from an authentication and authorization perspective, extremely important. The second thing is network policies. So your ingress, your egress, what pods can talk to, what pods, what pods can communicate outbound, inbound, etcetera.

[00:52:08.470] – Michael
And policy management. So your OPA is out there and your kyverno K-Y-V-E-R-N-O-I don’t know how to pronounce it. And your policy managers and stuff in general. So those are the three things that you should heavily think about when it comes to Kubernetes security.

[00:52:26.590] – Ethan
Okay, that was a lot. And it feels like we really did just see the tip of the iceberg there when it comes to that conversation. So I didn’t know. That is not actually what I expected you to say. I thought it was going to be you need to tweak a little bit here and a little bit there, and it’s not. Oh, yeah, it’s the fault to open and everybody can talk to everybody. Whoa, it’s a party.

[00:52:43.680] – Michael
Yeah, it’s open house.

[00:52:45.040] – Ethan
Okay, good to know. Good to know. Michael, great conversation. We’re going to do a part two here, but for those that are leaving us today, where can they find out more about you?

[00:52:56.740] – Michael
Yes, so pretty heavily on LinkedIn, so if you just go to Michael Levan levan. You can find me on LinkedIn. Twitter is at the NJ DevOps Guy and GitHub, so admin turn DevOps. You can find that handle there. And I do a lot of my open blogging on dev two. So dev two the DevOps.

[00:53:16.180] – Ethan
Great stuff. Thank you very much for joining us on day Two Cloud today. And again, if you’re listening, we’re going to part two of this discussion. In next week’s episode, we’re going to move from building Kubernetes Clusters, which was the focus of today’s conversation, to managing them. So we built the thing and we’re going to live with the thing in part two. And virtual high fives to you for tuning in. You are clearly an awesome human with impeccable taste in podcasts. And as a reminder, if you’ve been looking for more podcasts for your earbuds, remember that Mike Holsts the Kubernetes Unpacked podcast, which is also part of the Packet Pushers podcast network. You can find Kubernetes Unpacked wherever you listen to podcasts and subscribe, just search for it and it will pop right up. If you have any suggestions for future Day Two Cloud episodes, ned and I would love to hear them. You can hit us up on Twitter at Day Two Cloud Show or go to Multicloud IO and fill out the request form. One other bit of housekeeping for you. Packet Pushers has a weekly newsletter. Human Infrastructure magazine him is loaded with the very best stuff we thought on the Internet, plus our own feature articles and commentary.

[00:54:11.670] – Ethan
It is free. It doesn’t suck, we promise. And you can get the next issue via Packet Pushers net newsletter. And until then, just remember, cloud is what happens while it is making other plans.

More from this show

D2C218: What’s Inside The AI Magic Box?

AI and machine learning are being more widely used in IT and elsewhere. Today's episode opens the AI magic box to better understand what's inside, including software and hardware. We discuss essentials such as training models and parameters, software...

Episode 174