Search
Follow me:
Listen on:

Day Two Cloud 137: Automating Windows Container Builds

Episode 137

Play episode

Today’s Day Two Cloud gets into the weeds of a real-world project build around Windows containers. Windows containers? Yes, the goal was to run a particular job and deploy it on Kubernetes using Azure Kubernetes Service (AKS).

Our guest is Sai Gunaranjan, Principal Architect for a large healthcare company.

We discuss:

  • Why Windows containers instead of Linux
  • The pipeline Sai had in place before and what the team replaced it with
  • The build creation process and pipeline
  • Challenges with and benefits of the build process
  • Why the company eventually went back to Windows VMs
  • Lessons learned

Takeaways:

1. Windows Containers are real 🙂 and they work

2. https://github.com/actions/virtual-environments/

Sponsor: StrongDM

StrongDM is secure infrastructure access for the modern stack. StrongDM proxies connections between your infrastructure and Sysadmins, giving your IT team auditable, policy-driven, IaC-configurable access to whatever they need, wherever they are. Find out more at StrongDM.com/packetpushers.

Tech Bytes: Singtel

In this Tech Bytes episode we welcome back Singtel to discuss real-life examples of WAN problems from Singtel customers and how Singtel helped solve them, including deploying broadband Internet to augment MPLS circuits and getting better performance visibility into the WAN.

Show Links:

@asgr86 – Sai Gunaranjan on Twitter

Asgr.medium.com – Sai’s blog

Sai on LinkedIn

GitHub Actions Virtual Environments – GitHub

Day Two Cloud 128: DevOps’ing All The Things – Packet Pushers

Transcript:

[00:00:01.210] – Ethan
Sponsor StrongDM is secure infrastructure access for the modern stack. StrongDM proxies connections between your infrastructure and Sysadmins, giving your It team auditable, policy driven IaC configurable access to whatever they need, need, wherever they are. Find out more at StrongDM.com slash PacketPushers.

[00:00:36.030] – Ned
Welcome to Day Two Cloud. Today we’re going to be talking about containers, but not just any containers. Windows Containers. That’s right. What did you think of all that, Ethan?

[00:00:46.740] – Ethan
Oh, come on, man. It’s not even fair. This is not my thing. You know how you feel when I get into some. I don’t know, BGP conversation with someone you’re like. Oh, networking fun.

[00:00:56.030] – Ethan
That was a little bit of me in this one. Because this Windows Containers conversation Ned, got super nerdy and deep talking about pipelines, the challenges of deploying various things in Containers, the shortcomings thereof, and why ultimately. Containers may be weren’t the right answer for our guest.

[00:01:11.820] – Ned
Yeah, and our guest is Sai Gunaranjan. He’s a principal architect at a major healthcare company. And if you think you hear planes taking off, that’s not his mind taking off. He actually does live near an airport, but his mind also was just going 1 million mile an hour and we were struggling to keep up. So enjoy this conversation with Sai. Well, Sai. Welcome to Day Two Cloud. In a few words, why don’t you tell the good folks out there who you are and what you do?

[00:01:39.060] – Sai
Hi, Ned. Hi Ethan. I’m Sai Gunaranjan. I’m a principal architect. I’m part of the Cloud platform team that’s responsible for Azure, Azure DevOps and GitHub platform within the enterprise.

[00:01:48.160] – Ned
Okay, got you. So Azure building things out, working with Azure DevOps. And the reason we wanted to have you on the show is not just because you’re working with all these cool technologies, but because you’re working with Windows Containers, which is to me just wild. The reason I even heard about this was because we had Kyler Middleton on show 128, which we’ll include a link on and she was talking about DevOps’ing all the things, and she mentioned you’re using Windows Containers like for real. And once I recovered from my shock, I knew we had to have you on the show. So can you set this up for me? Sai, what was the original problem that Windows Containers was meant to solve?

[00:02:27.970] – Sai
The problem statement goes by design. Everything in public cloud is exposed to the Internet, the storage accounts, keywords container registries, and event hubs, and so on, which is not a good configuration to have. So all of these services are made private by either service endpoints or private endpoints, which then makes it extremely difficult to run DevOps on them because the public agents that Microsoft hosts on either GitHub or Azure DevOps, they don’t have access to these resources. So we started to go down the route of having our own private DevOps agents. That works very well. But then the challenge is a traditional VM based DevOps or a build agent, it becomes difficult to scale, it actually becomes difficult to maintain. So we actually wanted to look at containers as an alternative to run our DevOps jobs. So private containers within our own environment, within our own network, which then connect back into Azure DevOps or GitHub, and then they take up the build tasks and then they actually run against the private environments that we actually have. The other problem statement along with this is that it gets very expensive when you start to have full scale VMs running all the time and waiting for jobs.

[00:03:30.450] – Sai
So having containers was more kind of a more reliable and a cost saving approach that we wanted to go down.

[00:03:35.600] – Ned
Okay, that was a lot of information. So let me back you up a little bit of that. The first point that I want to hammer home a little bit is what you said that generally speaking, everything you deploy in the cloud is public by default. So if I spin up an Azure key vault, the connection endpoint for that is public. Anybody can connect with. They have to have the right credentials, right?

[00:03:56.760] – Sai
Yes.

[00:03:57.060] – Ned
But the actual end point is available. So what you’re saying is due to the industry you’re in, you’re in the healthcare industry, you need to be more secure. And so you want to take all these public endpoints and make them private, right?

[00:04:11.000] – Sai
Yes.

[00:04:11.590] – Ned
Okay, so that’s the first problem is you’ve made everything private, and then you mentioned public build agents. Where are these agents running and who’s providing them?

[00:04:20.860] – Sai
So making these PaaS services private, that solves one of the problems. The second thing is when you start to try to run pipelines on these agents, the public build agents that Microsoft provides as part of the default access that we get on Azure DevOps or GitHub, they can’t access them because now all of these are behind a firewall or actually fully on a private network.

[00:04:41.790] – Ned
Okay, I see. So those build agents that are kicked off by the pipeline, those aren’t running inside your network, they’re running just in the public Internet and they can’t get to these private endpoints. So suddenly that doesn’t work. So the solution is to bring build agents inside the network. And you did that by were you spinning up virtual machines or containers?

[00:05:03.690] – Sai
So initially when we went down the route, we started to have virtual machines, VM scale set based virtual machines which would register into Azure DevOps or GitHub. That teams can then assign tasks to those build pools and then they actually have access to these resources to perform tasks against them, like the deployment or configuration or whatever.

[00:05:22.120] – Ned
Right. And then the other thing you mentioned was cost and running those VM build agents. I guess the VMs are on all the time, right?

[00:05:31.790] – Sai
True. Yeah. So initially I think sometime back, Microsoft never had an option to actually auto scale these machines or kind of destroy these machines when there was no tasks available. So minimum number of machines always running up for jobs to be kind of picked up and so on. That actually poses a problem because now we have paying for three or four or five VMs that are always on, which are not really doing anything, but they’re always available to run a job, which makes it expensive for us.

[00:05:56.750] – Ned
I think I understand the problem statement in general, and it’s similar to what Kyler described in her episode, but Windows Man. Yeah.

[00:06:04.810] – Sai
So with that. So from there we actually had like a pool of Linux machines, a pool of Windows machines available to run these jobs. So the plan was we can actually make everything containerized. We can have these tools installed on containers. Containers are right now, they have a lot of functionality, benefits and so on. We can host up to maybe 40, 50 or even more number of agents, just six or seven container nodes actually like the host which Azure on the container’s clusters. That gives us a huge cost benefit, like 40 agents available at any point of time and you’re only paying for six machines of the underlying fabric. It actually gives us a huge cost benefit. So that’s one of the reasons why we wanted to go to containers. The other advantage of using containers in the scenario is all of these agents, like the DevOps agent from Azure DevOps or even the GitHub agent, they actually have a flag which you can set it to only run one time. The endpoint of the pod is actually running the agent. So after the job is completed, the pod destroys itself and then there’s a new pod created because now we have a minimum number of pods required in the container.

[00:07:06.210] – Ethan
I feel like I’m missing something here about why Windows containers though, because I haven’t heard anything that couldn’t have been done with Linux.

[00:07:13.910] – Sai
Yeah. So Linux actually does a lot of good stuff nowadays. There’s no conflicts against it. But then some of the traditional tools like the Visual Studio build tools, the SQL build tools are still stuck on Windows based machines. We can’t move them to Linux. We got to have Windows based machines for them. There are some legacy DevOps tasks as well, which we still need to use with the Windows OS core, we can’t go away from it. So while we still have a higher number of Linux usage, we got to support Windows build agents as well. For those legacy jobs that got to run, hence Windows containers.

[00:07:47.390] – Ned
You can’t just tell all the application developers, hey, you’re moving to. Net Core tomorrow?

[00:07:51.990] – Sai
Yeah, you can’t do that.

[00:07:55.210] – Ned
You could, but they would all leave or just say no. Okay, so what was the build pipeline you had in place? We kind of went over that. What are you looking to create with those Windows containers for your build pipelines?

[00:08:09.640] – Sai
The way we designed it was we wanted the Windows container to be the exact parity of the Windows VM that we are getting. So at least for Visual Studio code, Visual Studio SQL tools. So anything to do with the Windows based OS, that got to run Windows tools like the custom Java version that got to be installed and so on. So our success criteria for that was to get Visual Studio, I think 2017, 2019 and all of those installed along with one of the older versions, SQL build tools and. Net, Framework. Net, and just a few more tools which support the application development environment.

[00:08:43.750] – Ned
Wow, that’s a lot of software and that software is not small. I know when I’ve installed Visual Studio on my computer in the past, it’s been like and I need 20 gigs of space to install this feature. Was that a concern when you were building out these agents.

[00:08:59.930] – Sai
Oh, for sure. So space was one of the smaller problems we had. So when we started kind of going down this route, I started to just pull the base 2019 image of 2019. I think core image or something like that from Microsoft repositories and started to kind of get these tools installed. So the first challenge we ran into is unlike a Packer builder, like a traditional VM, we can’t really restart these Docker process. You can’t restart a machine. So .net framework installation starts to fail. Visual Studio installation starts to fail because after the installation is completed, they want to re start themselves, which you cannot do in a Docker based world. So that was the first challenge. So then we had to switch to a different image which already had dot net pre baked into the image, which again Microsoft publishes. So from there, then you start to install more tools. Then you start to find out that Chocolatey is your best friend now because Chocolatey, you can actually install so many tools along with it without restart option without having to update all the environment variables and so on. So the process goes like get the image, install chocolatey, install PowerShell the newer versions like seven and so on, PowerShell, Core and so on.

[00:10:06.800] – Sai
And then after that Chocolatey starts to install Visual Studio 17, Visual Studio 19 and Data tools and some more add on tools like Java and so on.

[00:10:16.470] – Ethan
And we’re building the container that ends up in the repo or when the container is launched.

[00:10:20.770] – Ethan
It launches chocolatey and does all this other install. Once the container is instantiated.

[00:10:25.270] – Sai
It’s built and hosted in the repo. Okay, so it’s not installed on the launch. It’s actually already pre built.

[00:10:31.220] – Ethan
This really is a huge container then.

[00:10:34.610] – Sai
It is huge. But it’s much smaller than your traditional VM. The traditional VM Packer builds take eight or 10 hours to complete. The same thing in this container will take like maybe two or 3 hours to build.

[00:10:46.370] – Ned
Yeah. When you’re starting with a Packer image, that was probably like you said, eight to 10 hours and I don’t know, 60, 70 gig of space.

[00:10:54.380] – Sai
Oh, for sure. Yeah.

[00:10:55.710] – Ned
Shrinking that down two to 3 hours build time. It’s not bad. Roughly how big were the container images?

[00:11:01.940] – Sai
I think 2.5 gigs. Yeah, I think 2.5 don’t remember the exact number, but on two, five gigs is what we ended up with eventually.

[00:11:10.040] – Ned
Wow. I mean, that’s definitely smaller than I would expect. Okay, so when you’re building these agents, these images, are you also doing that build inside a pipeline?

[00:11:22.930] – Sai
Yes. The docker stuff runs within the pipeline and that is the only place where we used to actually use public agents to run the Docker. The Windows Docker process to actually build up these images. Once images built, it gets pushed into our private Azure, the container registry, and from there the Kubernetes cluster picks it up.

[00:11:43.390] – Ned
Okay, that makes sense, right? You’re using ACR to host those images, but it has a private endpoint or can it have a private end point? I’m not sure.

[00:11:53.680] – Sai
You can have a private endpoint on ACR, but in this scenario, since we were actually running the whole Docker process on a public agent, then we had some fancy scripting built into the pipeline that is to actually get the public IP address of that agent, Whitelisted on the ACR and then the public agent would have actually access to push it and then it’ll remove the firewall entry and all that kind of stuff and then they set back again private. That was the only place we wish to do that, as I was mentioning earlier. So it’s not difficult to use public agents. It’s not impossible, but it’s just difficult. You have to Bake in all of these extra steps to actually change the firewall settings each time the pipeline runs and everything, which is painful to do.

[00:12:31.570] – Ned
Got you. So what would you say are, were the main challenges you found when you were trying to set up that build agents creation process.

[00:12:39.730] – Sai
The biggest challenges we had was having multiple tools installed like Visual Studio. That was not really the problem, but we started to go down route like having Java and NodeJS and those kind of tools installed. They become very tricky to have multiple versions installed because they all try to install the same path, location and stuff. You have to kind of change, tweak that. Then after you install it you have to also actually update the environment variables. That becomes a challenge again in the container world because it’s not very easy to do that. You have to have very nice scripting around it and then the overall process to build it and sometimes it crashes because Visual Studio fails at some point of time. So the whole process actually is a bit challenging.

[00:13:22.030] – Ned
It sounds like it. Were you making multiple images that had different software installed? So maybe this is the version that has Visual Studio 2017 and this version of Java, or were you trying to put all the tools on one big image?

[00:13:38.420] – Sai
We were actually trying to do everything in one big image. Maybe that was the issue we were running into. So we were actually hosting it for various teams to use. So we didn’t want to go down the route of having multiple smaller images and then having to host them independently, build them independently and so on. So we kind of wanted to mimic what we had in the VMs. Additionally, also tools such as Headless Chrome, Selenium drivers, all of them also have to be installed. They actually become a very big challenge for us because those tools need custom fonts and it’s not easy to install custom fonts on a container image because you have to pull them from Microsoft and then you have to download them. It’s a more challenging thing. So we want to kind of pump everything to one image and just kind of keep the centralized image. And if teams wanted to add a tool, the point was here’s, Docker definition. If you want to add a tool just to contribute to this definition and you add a tool into it and it’s available for you to use, that’s what we want to kind of encourage the teams to do as well. Then we soon ran into challenges where not all tools supported on containers like you can’t have WSL installed.

[00:14:38.870] – Sai
Some team wanted to have Bash installed. Wsl installed those are not supported on containers as of now.

[00:14:44.740] – Ned
Yeah, because that uses HyperV in the background. I’m guessing HyperV won’t run on Docker.

[00:14:50.410] – Sai
Yeah. Those are challenging things which kind of make it difficult to use containers in a build context for all jobs.

[00:14:58.090] – Ned
I want to back up something you said which was just wild to me is you have to install custom fonts for an operating system that for all intents and purposes is not going to be using fonts at all. Yeah.

[00:15:09.480] – Sai
So I think that’s the thing, right? When you get a container image, those images don’t come with the fonts installed, unlike a full scale VM, full scale OS image. So Headless Chrome and Selenium, they require certain kind of fonts to run and they aren’t available. There is some PowerShell script which you can download and install and so on, but it kind of gets complicated. It’s not pre baked into the image. That’s what I’m trying to say.

[00:15:33.650] – Ned
Yeah, I know. It’s wacky because fonts on a Headless serverless. Basically it’s like why?

[00:15:42.400] – Ethan
Does this make the container fragile Sai? Because you’ve had to solve all these really unique and specific problems to get the container built successfully. Does it mean that you’re constantly revisiting this thing because something broke three months later?

[00:15:55.230] – Sai
For sure. So something either breaks or it becomes incompatible, or a newer version of the job doesn’t work because the new tools need to be installed. It’s more maintenance from an image point of view because not everything is properly installed like a full scale VM.

[00:16:09.330] – Ned
Right.

[00:16:09.890] – Sai
And then the also can’t install this Headless Chrome kind of tools, they just won’t install because the fonts are missing. Then when you have the test cases running for web tests and stuff, those jobs don’t run because they don’t have Haedless Chrome installed, they don’t have Selenium drivers installed.

[00:16:25.750] – Ned
Looking to the positive side of things, what were some of the successes you had out of adopting this Windows container model?

[00:16:33.970] – Sai
Oh, cost was one big thing. Having agents available at any given point of time to be able to run these jobs and having like less than I want to say less than a second wait time on build agents on the build pipelines was a big thing for us. And we were able to have one AKS cluster that actually has both Linux and Windows node pools attached to it, like a maximum of six servers. And then I was able to have 40 different agents across multiple OS types available. If suppose this was a VM scale set. So just imagine I would have actually need to have like three or four scale set deployed, each of them having so many machines available and then they have to scale up and so on. All of that was gone. We had a good success. We actually ran that for a lot of time with this kind of model.

[00:17:22.220] – Ned
Okay, so the build agents and you mentioned this before, I think they’re all running in Azure Kubernetes service on AKS cluster. And then you mentioned you had Linux node pools for the Linux build agents and Windows node pools for the Windows agents. I haven’t actually used Windows node pools at all in AKS. Were there some challenges around using Windows hosts?

[00:17:47.770] – Sai
I don’t know the challenges, but there are some gotchas though. Maybe it was for me because I don’t have a huge AppDev backgrounds. Maybe it was more for me for my learning. When working with Windows based node pools, you have to actually tag all the nodes correctly and use the Tags in your Kubernetes deployment. If you don’t do that, the Kubernetes deployment actually picks up that image and I should deploy it on any available node. Like it could be Linux as well. And these pods don’t start to come up then because of incompatibility. So that’s one thing which was like okay, I got you there. You have to kind of tag your machines correctly and then use the Tags correctly in your deployment as well. Especially in the Kubernetes definition. The kubectl apply definitions the deployment of YAML. Config file. Other than that, it actually was pretty smooth. Like you can’t really run an independent Kubernetes cluster. Windows Kubernetes cluster. You have to always have a Linux pool attached to it because Linux is where the controllers and everything is installed. And then your Windows pool is more like an add on to the existing cluster.

[00:18:44.950] – Sai
In my scenario, rather than having two independent clusters and running almost like nine machines, get everything to one cluster. One set of YAML file definitions actually get deployed onto the Linux side and on all the windows one with the correct Tags and everything is deployed on the Windows nodes.

[00:19:01.770] – Ethan
[AD] We pause the podcast for a couple of minutes to introduce sponsor StrongDM’s Secure Infrastructure Access platform. And if those words are meaningless, StrongDM goes like this. You know how managing servers, network gear, cloud VPC databases and so on. It’s this horrifying mix of credentials that you saved in putty and in super secure spreadsheets and SSH keys on thumb drives and that one doc in SharePoint. You can never remember where it is. It sucks, right? StrongDM makes all that nasty mess go away. Install the client on your workstation and authenticate, policy syncs and you get a list of infrastructure that you can hit when you fire up a session. The client tunnels to the StrongDM gateway and the gateway is the middleman. It’s a proxy architecture. So the client hits the gateway and the gateway hits the stuff you’re trying to manage. But it’s not just a simple proxy, it is a secure gateway. The StrongDM admin configures the gateway to control what resources users can access. The gateway also observes the connections and logs who is doing what database queries and kubectl commands, etc. And that should make all the security folks happy.

[00:20:06.520] – Ethan
Life with StrongDM means you can reduce the volume of credentials you’re tracking. If you’re the human managing everyone’s infrastructure access, you get better control over the infrastructure management plane. You can simplify firewall policy. You can centrally revoke someone’s access to everything they had access to with just a click. StrongDM invites you to 100% doubt this ad and go sign up for a no BS demo. Do that at StrongDM dot com slash packet pushers they suggested we say no BS and if you review their website, that is kind of their whole attitude. They solve a problem you have and they want you to demo their solution and prove to yourself it will work. StrongDM dot com slash PacketPushers and join other companies like Peloton, Sofi, Yext and Chime. StrongDM.com slash packet pushers. And now back to the podcast. [/AD] [00:21:00.140] – Ned
Yes, that makes a lot of sense. I knew that. Obviously the master nodes in the configuration, those are all running Linux because they have to. I didn’t really think about the fact that the default node pool is going to be Linux as well, so you can scale it down to one, but you still need that one there.

[00:21:16.010] – Sai
Yeah, you definitely got to have Linux machine with Azure Kubernetes.

[00:21:21.560] – Ned
Got you. So on the application development team side, how did they go about selecting the proper agents to do their build? How did they differentiate between one Windows agent or another?

[00:21:36.490] – Sai
We actually had them nicely tagged. The agent pool names were actually kind of representative of what tool to run with. So if they wanted to run like the tools that we were actually offering as part of the container stuff. They get tagged with like Visual Studio 2017 or stuff like that, and then they can actually call the pool definition within their YAML pipeline definitions and the containers actually pick up the job and then run it.

[00:22:00.080] – Ned
You made it pretty easy and straightforward for them to just plug right in and start using your build agents.

[00:22:05.160] – Sai
Yeah. The advantage we had with all of the definitions actually can be fed into the Docker config itself and as part of the agent installation. So the container part has been coming up. It actually knows which agent will register to, whether it’s a GitHub agent, whether it’s an Azure DevOps agent, what is it, where to go? All the stuff is baked into the Docker definition itself, so it’s all pre built into the image. There’s no additional stuff that needs to be done as a spin up time. That way we were able to reduce our spin up times to a few, I think maybe like 90 seconds or something for all Windows machines. So unlike a traditional VMware, if you have at least a few minutes of spin up time, I was able to get new agents, new pods registered and online within as soon as 90 to 120 seconds, like three to four minutes. We’ll have so many more pods available. And the same thing for VMs is not possible. You only get one or two machines.

[00:22:52.890] – Ned
Okay. So if a bunch of developer teams submitted build jobs all at the same time, you’d be able to scale out to handle that relatively quickly.

[00:23:01.530] – Sai
Yes.

[00:23:02.100] – Ned
So they’re not just all waiting in line for the three machines that are running.

[00:23:05.900] – Sai
Yeah, exactly.

[00:23:09.310] – Ned
When you were setting all this up, the Windows containers and the Windows hosts on AKS, did you find the documentation really good or did you have to reach out to the community a lot?

[00:23:18.310] – Sai
No, documentation is really good. There are some sample definitions, Microsoft on how the Docker file configured go for Windows containers. And there’s a lot of already like on Stack overflow on GitHub. There’s a lot of comments about how these things, especially the Chrome headless stuff and all this, I got it from there. It’s almost next impossible to correctly do it. I think it’s doable. But then you have to actually have the correct script. You have to actually have the fonts, all that kind of stuff correctly installed. So that’s where it’s actually good documentation, from Microsoft and the community as well.

[00:23:53.790] – Ned
It’s sort of ties into what we’re talking about before the recording. Like some things are not documented well in Azure. I think we can all acknowledge that. And I’m glad to hear this is not one of those cases.

[00:24:04.430] – Sai
And the best thing is I think you actually can contribute back, right. All of these are like GitHub pages, which you can then make a PR and say, hey, I have some updates for documentation and stuff. So I like that part of all documentation. These are from Microsoft. So in case you find something missing or something is incorrectly defined because of older versions, it’s always great to just submit a PR and have them update it.

[00:24:26.390] – Ned
When we were prepping for the show and I was all excited to talk about Windows containers and you’re like, yeah, we did all the stuff and then towards the end you’re like and then we got rid of it and went back to Windows VMs, can you talk a little bit about why that might be?

[00:24:41.820] – Sai
I think at some point we actually got to the stage where so it wasn’t practically enough to run VM schedule along with Windows containers and again have Linux containers and everything run at the same time. And then we were unable to provide all the tools and the capabilities that the app teams wanted to run their pipelines with, which started to become a problem because then they start to use public agents or you start to get into trouble where they just have jobs waiting for our agents to come up and then do the jobs and it kind of becomes a more challenging and admittedly it becomes now I have to maintain a Packer definition also Docker definition because I have two set of pools and one to do a specific set of tasks, otherwise to do a more wide variety of tasks and so on. So it’s like okay, let’s scrap this whole Windows container thing. Let’s move back to the Packer based VMs. That was one of the objectives. The other thing which we actually found out was Azure DevOps now supports auto scaling and self destruction of agents automatically. So they do all of the stuff for you.

[00:25:36.580] – Sai
So Azure DevOps now you can actually say hey, this is my VM scale set, this is a subscription that they are actually on and this is the credentials for you to manage it. It will install the tools, it’ll actually install the agent, it will update the agent on a regular basis. It actually scales up, scales down the agents based on number of jobs waiting for the agent to run. So if you have the build pool and you only have three agents and you have five jobs waiting, they kind of start to scale up to seven or eight available agents all the time. And then once the job is completed, the agents start to self destruct themselves. So they clear up, they delete the VM and the new one is pinned up again. The whole cycle of what we Azure doing in containers is now baked into Azure DevOps by default. So there’s really no fun in doing this stuff. Like it was cutting edge, it was all interesting, it was a fun project for us to do. But then Microsoft kind of solve the problem for us DevOps. We went back to the VM scale set.

[00:26:29.010] – Sai
So now at any given point of time, even though having like ten or twelve machines running, they only run for the job and they kind of go away after job is completed. So once there are no jobs for 30 or 40 minutes, the agents go back to one or zero. You can go down completely.

[00:26:44.390] – Ethan
But you were talking about the container builds taking two to 3 hours, and then the Packer builds for the full VMs like eight or 9 hours. You can live with that, it sounds like.

[00:26:52.630] – Sai
We can live with that because we run most of our jobs like overnight, every alternate day or kind of overnight, and we don’t want to take up the morning compute time with all application teams and everything. So the challenge was not with the time of the whole Packer build, but the challenge was maintaining it and updating it and the cost associated with it. So we can definitely live with the time to build up those tools, especially with advantage that now we are able to provide all the tools that Microsoft provides. The other thing is that Microsoft publishes their Packer definitions in a GitHub repo. You actually can download that latest tag and everything, and then you can actually build your own image based on the definition that they provided and hosted internally and then use that to host your agent. So that way app teams don’t have an excuse like hey, your build agent doesn’t have this tool because you’re also the same parity as what Microsoft is running.

[00:27:45.020] – Ned
I don’t think I realized that. So you’re not writing these Packer templates from scratch. You’re taking something Microsoft has already put together. And it’s the same or similar to what their public build agents already had.

[00:27:59.220] – Sai
That’s what documentation at least says that the Tags that they provide with those packer definitions is what they use within Azure DevOps and GitHub as well. So we take the same image, the same packer definition, modify a little bit with anything as extra we want, and that’s it.

[00:28:17.570] – Ned
Part of the reason that you stopped using Windows containers was because there were still, I’m guessing some application development teams that could never move to that model, correct?

[00:28:26.400] – Sai
Yeah.

[00:28:27.770] – Ned
Okay, so you still had to maintain the Windows VMs for some of the teams. Plus you were doing the container images and Linux on top of that.

[00:28:40.290] – Sai
And as we started to move to more Docker based builds. The Windows containers don’t really run Docker as well. So running Docker within a Docker image is a little more challenging than what it is. Maybe now on Linux you can do it to a certain degree, but Windows, you still can’t do it.

[00:28:54.290] – Ned
What would the situation be where you’d want to run Docker in Docker?

[00:28:58.030] – Sai
So if an app team wants to actually build a Docker image and then publish it for their application, now they are stuck.

[00:29:06.810] – Ned
Okay, so if they’re doing a containerized application, they can’t use your agent as a build agent because it’s already a container.

[00:29:14.070] – Sai
Exactly. Yeah, that’s another challenge.

[00:29:16.430] – Ned
Okay, so in that case they would use a Windows VM or a Linux VM to do that image build process and publish it out. Okay, so the Windows VMs, the build agents that you move back to, those are now running inside your Azure subscriptions, inside a VNet that you control.

[00:29:36.270] – Sai
Yeah, they’re all private VM scale sets within our own network, which then connecting to the various other networks and then can perform the tasks that the teams actually define it to do.

[00:29:48.180] – Ned
Okay, so in a sense, Microsoft was like, hey, I see you have all these problems Here’s a solution. You can stop, you can stop duct taping things together for now.

[00:29:59.310] – Sai
But it’s still not available for GitHub though. So GitHub is still open. The auto scaling part of GitHub is still, it’s not baked into the platform. There are some other tools and workarounds to do about it, but it’s not as easy as what Azure DevOps provides us with. So on the GitHub side, you’re still stuck with a starting number of agents and manually. Like manually, I mean some scripted approach of scaling up and scaling down rather than how Azure DevOps gives it to us.

[00:30:24.020] – Ned
Okay, is that tied to GitHub Actions doesn’t have that available. Yes.

[00:30:28.090] – Sai
Okay, yeah.

[00:30:28.780] – Ned
Okay, so if you’re using GitHub Actions, you don’t have that functionality yet, but if you’re using Azure DevOps pipelines, then you do.

[00:30:35.880] – Sai
Yes.

[00:30:36.590] – Ned
And you’re using both.

[00:30:40.050] – Sai
Yeah, we have a combination of GitHub as well as Azure DevOps.

[00:30:47.590] – Ned
Got you. Okay, yeah. Why not? Any Jenkins going on in there too, just for the front of it? A little GitLab CI? I don’t know what you have to do. Even though you ultimately went back to Windows VMs, I’m sure that you still gleaned some additional insight or some lessons learned or some additional knowledge from trying to use Windows containers. What lessons did you learn that you could apply to future projects?

[00:31:18.110] – Sai
I think the biggest takeaway would be is if you have a static set of tools and a limited kind of jobs that you plan to run on the tool. Like if it’s like a pipeline, it’s only going to do this like this version of Visual Studio, this version of Java and so on. It’s always good to have Windows containers if you have any lightweight jobs. I know Terraform runs on Linux as well, but Terraform kind of jobs or any traditional PowerShell stuff which needs to run on Windows based PowerShell, the older versions of PowerShell and so on. Windows Containers are really good in that aspect. Also documentation is very good from Microsoft and it’s also a good starting curve. Like if you want to really start off with containers and see what’s going on and get all the tools installed, play around with it. I think this is really a good, nice project to do.

[00:31:59.390] – Ned
Okay. Would you ever run Windows Containers locally? The reason I tend to use Containers locally is to try out a piece of software that I don’t want to install, or I need to spin up like three or four in the cluster to get them work together. Have you done something like that?

[00:32:15.090] – Sai
No. not really.

[00:32:19.510] – Ned
So do you think Windows Containers are a viable solution for some problems out there? It sounds like you had the one in particular, which is if you have a very limited static set of tools. Is there any other reasons you think you would use Windows Containers?

[00:32:32.810] – Sai
I think from appdev point of view, it’s always good to have a container versus a full scale VM. So if they can get the application run on a container, like even if it’s an IIS server or something like that, if they can do on a container, why would they want to host a full fledged VM? So containers do work. They can solve some use cases, but I don’t think it’s suitable for everything. You can’t just bunch everything into containers and say, here you go, make it work. Yeah.

[00:32:56.760] – Ethan
The reasons you switched from container back to VM were pretty specific to your use case. You just had some problems that weren’t really nicely solvable in the container form factor. Besides, Azure giving you the tools you needed anyway.

[00:33:08.800] – Sai
True. So not only our use case, but in general the whole build agent concept of running time, turning down containers and trying to support all the tools, all the versions. And that’s where containers become more of a problem, right?

[00:33:21.810] – Ned
They’re meant to be lightweight and small, and if you end up putting tons and tons of stuff on it, then just becomes a nightmare to manage that image. Well, this has been a really interesting conversation. I’m glad to talk to someone who actually used Windows Containers because I remember when they introduced them, God, that had to be like four or five years ago. And I thought to myself, but why? Yes, there are some realistic applications for it. I know this in the long term didn’t work out for you, but it certainly filled a need for some period of time and others could use it simply for application development. So it’s really interesting to know, do you have any key takeaways or things you want the audience to walk away hearing? I should say.

[00:34:09.230] – Sai
I think Windows Containers works. They are real. They do have a good use case for a set of tasks and everything. Other than that, I think the Takeaway, which I can provide is the GitHub repository for Packer definition. In case you’re trying to host your own private agents within a network, I would highly recommend not to kind of reinvent the wheel and try to install tools. Everything packer definitions, all levels Microsoft. Please review them, download them, and use it within your environment.

[00:34:37.370] – Ned
All right. If folks want to know more about you, are you a social person? Do you blog or are you on Twitter?

[00:34:43.850] – Sai
I blog on Medium. Not a lot but I do blog on Medium and I’m on Twitter. I’m available on Asgr Medium.com and asgr 86 on Twitter.

[00:34:53.280] – Ned
Well, Sai Junaranjan, thank you so much for being a guest today on Day Two Cloud. And hey, listeners, stay tuned for a Tech byte from Singtel. That’s coming up right after this. Welcome to the Tech Bytes portion of our episode. We’re in a six part series with Singtel about cloud networking. That is how to make your existing wide area network communicate with cloud services in an effective way that maybe your legacy WAN isn’t able to. Today is part four of Six, and we’re chatting with Mark Seabrook, global solutions manager at Singtel, regarding some customer problems where they’ve had large WANs deployed, but found those wide area networks insufficient because of their workloads found in public clouds. Mark, welcome back to Day Two Cloud. You’ve got some customer stories, which are our favorites to share with us, and we want to focus on the problems those customers were dealing with in this tech byte. You don’t have to name name because I know that’s a contentious thing to do. But could you first summarize the type of network your customers had, some sort of a large MPLS WAN, right?

[00:36:05.630] – Mark
Yeah, absolutely. So a lot of customers had global or used to have global MPLS networks. So we would put out two or one MPLS leg per site, thousands of sites across the world. One of the big problems when moving to the cloud is everything was routed back by regional private data centers. There was no local breakout, no local Internet breakout. So we had a lot of issues with SLA levels. For example, different parts of the world have different SLAs and just overcoming that breakout at a local level flexibility.

[00:36:49.890] – Ned
Right. So if I could paint a picture a little bit, you’ve got all these different sites that have a network coming back to a central location in their region. So if we’re in the US, everything is coming back to New York or something, and then it’s going out to the Internet or to the cloud provider. And that’s pretty inefficient from a routing perspective. So you’re implementing something to change that.

[00:37:10.270] – Mark
A lot of our customers sites would have a single MPLS. And over the years, we’ve introduced a DIA Internet circuit at each site, however, still didn’t have the ability to monitor it and give it like a 10,000 foot overview of what was going on across the globe or across the region.

[00:37:31.240] – Ethan
So, Mark, like you say, direct Internet access, as in they’re pushing a lot of traffic just directly to and from the Internet to get to their cloud services, and the rest of it would be going over the MPLS.

[00:37:43.950] – Mark
Yes. When I say Dia, we’re talking dedicated Internet access. However, a pure underlay. So unless you introduce SDWAN at the site level with the orchestration at a regional level, you’re not going to have any control over that Dia. So we got to a stage where a lot of sites were pushing 70% of their traffic actually over the Dia, the internet pipe as opposed to the MPLS, but we still don’t have any control over it. So by introducing an SDWAN across the network, we could control the local cloud breakout. We’ve got the orchestrator at a regional and a global level, and we can look at anything at any time and tweak anything in real time anywhere across the world.

[00:38:34.230] – Ethan
Okay, SDWAN as in. Now we’ve got an overlay on top of the MPLS circuit and the dedicated internet access circuit. And you can apply policy to that to have routing go over whichever circuit you want to meet whatever traffic forwarding criteria that you’re looking for.

[00:38:51.300] – Mark
Absolutely. Not only that, a lot of sites where we moved away from MPLS and we went to a dual fiber internet solution, we could actually still give an MPLS like SLA in various parts of the world just simply by the redundancy and the load balancing magic that happens on an SDWAN.

[00:39:13.910] – Ned
Load balancing magic. I love the way that you put that in. Definitely. It sounds like an improvement over the MPLS and the separate Dia. Were there some concerns either from security or Privacy when it came to moving from dedicated circuits and MPLS over to an SDWAN type solution?

[00:39:34.590] – Mark
Absolutely. So we have some government customers where they’ll probably never move away from a private MPLS or private layer two connectivity. However, for a lot of commercial customers, the IPsec tunnels, the security, the pointing the internet breakouts through like a Z scaler, for example, soothes a lot of the fears that the customers did have from going from an MPLS to pure internet on a site level.

[00:40:05.070] – Ned
Okay. So the customers that we’re talking about, they were looking to have additional control at the branch level. What types of things were they trying to control for, or were they really just trying to get visibility and monitoring or both?

[00:40:20.190] – Mark
I’d say the first thing is the visibility and monitoring. So if you look at a traditional MPLS world with regular routers at the CE level, there’s really not a lot that you’re monitoring minute by minute, day by day. And especially if you go to an internet model where you don’t have SDWAN, what you’re pushing over that is your visibility is kind of limited. One of the wonderful things about the SDWAN solutions that we’ve rolled out is the orchestrator. Remember of the customer can. One of the wonderful things about the MPLS that we’ve rolled out is the orchestrator. So from a global level, you can go to one screen, look at all your devices, click on a device, get into it, look at all the tunnels, look at exactly what’s happening real time, all your underlays.

[00:41:23.170] – Ethan
When you do that in the Orchestrator, you get a sense of there’s two things happening. You’re talking about the underlays. These are our physical circuits, and then the overlay what the tunnels were actually pumping traffic through over the top of those circuits. But you have, via the orchestrator a clear idea of what’s going on.

[00:41:42.010] – Mark
Yeah, absolutely. You can even narrow down to the bandwidth what each tunnel is running at the performance level. You can look at all your underlays. So you can look in parts of the world. You can pull up stats from parts of the world where Dias are much more reliable than other parts of the world versus MPLS or layer two. Really, what you can monitor is only limited by your imagination, to be honest.

[00:42:11.370] – Ethan
So Mark, how about that monitoring? Another one of the advantages here that we’re getting is being able to dynamically react to changing network circumstances. So at one moment, maybe the MPLS is going to be best performing for certain traffic and maybe Dia is best performing at other parts of the day, right?

[00:42:29.230] – Mark
Yeah. I mean, with, for example, Silver Peak, it’s doing that dynamically every second of the day, so you don’t even have to worry about it. If the local edge connects detects some bandwidth fluctuations or some jitter or some packet loss on a particular underlay, it will push traffic over another link. It will use forward error correction, various tools to build up that kind of I call it a Magic Q OS, a way of establishing and maintaining that MPLS SLA that you’ve enjoyed for years, but over a couple of different diverse Internet circuits.

[00:43:19.670] – Ethan
Okay, Magic Q OS as in, if it’s going over DIA, you don’t actually have hop by hop control as you would with a true Q OS system where you can tag the traffic with the DSCP value and then hop by hop. There’s a behavior that’s that package to be treated in accordance with. We don’t have that with the Internet. It’s a best effort transport. So how do you get a Q OS like experience? Well you monitor the behavior of the circuit end to end, and then push traffic over the circuit that’s going to deliver you the SLA you’re looking for at any given moment in time? You’re not guaranteeing behavior across the circuit, but since you know what the circuit is going to deliver to you, you can push traffic where it needs to go. That’s what you’re getting with Magic Q OS. It isn’t actually Q OS, but it ends up with a similar result.

[00:44:07.370] – Mark
Yeah, correct. So basically dynamically we’ll monitor the boxes monitoring from the site, all of the underlays, and in real time, it’s moving traffic around across tunnels, across overlays. We also put out thousand eyes Enterprise agents to a lot of our customers that go with our uCPE model. If you want to do some real deep dive diagnostics on some of the Internet underlay that the SDWAN isn’t giving you, you’ve got thousand eyes to go back and take a look and fine tune stuff. One thing I also say we do use deterministic Internet around the world. So we do have internet providers around the world that are partners where we have tweaked routing at a BGP level to take more optimal routes that we can actually control.

[00:45:04.140] – Ethan
Okay, so that’s actually manipulating your traffic forwarding in the underlay in certain circumstances.

[00:45:11.510] – Mark
Correct. We’ve also some of our MPLS nodes around the world we’ve actually got internet breakouts so you can point from a local Dia say in the States to our gateway say in LA and then it will jump on a private deterministic route back to somewhere in Asia.

[00:45:31.160] – Ned
Okay, so that’s almost like a cloud accelerator product that you might see in AWS or Azure but this is private, correct?

[00:45:38.460] – Mark
Yeah, we also use that with our IP transit, our sticks product. So for example, if you were to point or connect to for example, let’s just say our IP transit node in San Jose we will give you an SLA and deterministic routing. Obviously back to somewhere in Asia.

[00:45:59.450] – Ned
Okay, I got you. Excellent. Well, thank you for joining us, Mark. And hey, thanks to everybody out there for listening. This was just part four of a six part series so we’re going to hear more on building cloud ready networks with Singtel in upcoming episodes. Part five will be in a couple of weeks and we’ll be reviewing solutions in the Singtel catalog that will help you turn your legacy WAN into a cloud ready network. Thank you to our guests for appearing on day two cloud and virtual high fives to you for tuning in. If you have suggestions for future shows we would love to hear them hit either of us up on Twitter at day two cloud show or you can fill out the form of my fancy website. Nedinthecloud.com, did you know that you don’t have to scream into the technology void alone. The packet pushers podcast network has a free slack group open to everyone. Visit PacketPushers dot net slash slack and join. It’s a marketing free zone for engineers to chat, compare notes, tell war stories, and solve problems together. Packetpushers dot net slash slack. Until then, just remember cloud is what happens while IT is making other plans.

More from this show

Day Two Cloud 153: IaC With GPPL Or DSL? IDK

On Day Two Cloud we’ve had a lot of conversations about using infrastructure as code. We’ve looked at solutions like Ansible, Terraform, the AWS CDK, and Pulumi. Which begs the question, which IaC solution should you learn? A Domain Specific Language...

Episode 137