Search
Follow me:
Listen on:

Day Two Cloud 128: DevOps’ing All the Things

Today is a deeply technical episode on DevOps, Azure, Docker, Terraform, and more. Our guest is Kyler Middleton, Principal DevOps Network Architect. Kyler is also a Pluralsight author and frequent blogger (see links below for examples).

We discuss:

  • How and why Kyler went from network engineering to cloud
  • The environment she works in
  • What processes were like before and after a DevOps approach
  • Building a pipeline
  • Teaching other engineers
  • More

Sponsor: Juniper Apstra

Apstra’s Intent-Based solution simplifies data center network deployment, operations, and management from Day 0 through Day 2. It delivers automation and continuous validation of your data center network in multi-vendor environments. The result is savings on downstream costs and exponentially more value from your network investments. Find out more at juniper.net/packetpushers/apstra.

Takeaways:

  • CI/CDs should be transparent and auditable.
  • Docker and Terraform permit clear and powerful deployment and scaling
  • Standard technologies, like docker, can help keep multiple clouds synchronized

Show Links:

@kymidd – Kyler Middleton on Twitter

kylermiddleton.com, medium.com/@kymidd – Kyler’s blogs

Kyler Middleton on LinkedIn

Video: How We use Terraform to Deploy Container-Driven Builders in Azure DevOps – kylermiddleton.com

Let’s Do DevOps: Build AWS ECS on Fargate Using a Full-Featured Terraform Module – Faun.pub

Terraform_AwsEcsOnFargate_CompleteModule – GitHub

Let’s Do DevOps: Share ECR Docker Image and Secrets Between AWS Accounts – Faun.pub

Terraform_ADO_ECR_Multi-Account_Access – GitHub

Transcript:

[00:00:00.850] – Ethan
[AD] Sponsor Juniper Apstra’s intent based multi vendor networking solution helps you build your data center network to a specific design, then make sure it stays within that spec. Deployment, automation and continuous validation. Find out more at Juniper dot net slash PacketPushers slash Apstra. [/AD] [00:00:23.670] – Ned
Welcome to Day Two Cloud. And boy, Howdy have I got an amazing show for you today? And I say I, because it’s just me. It’s just Ned. I’m riding solo. Ethan had some things going on. I’m like, you know what? I got this. I can handle this. You know why? Because I am super pumped to talk to today’s guest. Her name is Kyler Middleton, and she is the principal DevOps network architect at Veradigm, and she’s going to be talking about building builders with Azure DevOps pipelines. We’ve got some Terraform in there, and we get into the nitty gritty.

[00:00:54.690] – Ned
So if you’re looking for a fun technical episode where we go way deep on some technology, this is the one for you. So stay tuned for this episode with Kyler Middleton.

[00:01:05.610] – Ned
Well, Kyler, welcome to Day Two Cloud. Now your title. You’re a principal DevOps network architect, which sounds like at least three different jobs to me. So tell me what’s going on with that title and what you do.

[00:01:20.800] – Kyler
Absolutely. They wouldn’t let me put any other job titles in there or it have been even longer. My background is network engineering, and I came into this as a network engineer, and then they said, hey, we’re going to the cloud. You better figure it out. So I dove into that. I’m at a highly DevOps automated shop in terms of cloud form and Terraform and Infrastructure’s code and automation, and just culturally, DevOps, I’d love to talk more about that. And so I just kept tacking them on, put another job into that job title.

[00:01:53.190] – Kyler
And in my experience, DevOps is everything. You’re the bridge between all the different specialties. So I find it a lot of fun to talk to everybody and be the bridge to everything.

[00:02:05.130] – Ned
Yeah, that’s awesome. You can kind of be a little bit of a generalist, even if you come from a specific background. And you said you come from a networking background, which I think before we started recording, you mentioned that’s kind of rare to find someone who comes from that networking background and makes the move into DevOps. Do you think there’s a reason for that?

[00:02:22.230] – Kyler
I think it’s because people like me that wanted to hide away from all of the users, find their safe haven with networking because you get to sit in a dark, dank networking closet and plug in all the cables and never talk to anyone. But for some reason, I gravitated towards it. I find it really logical, like, hyper, logical and interesting to think about latencies, and the Internet is just fascinating. It shaped our culture in a neat way, but it’s just a network. It just connects the computers.

[00:02:57.050] – Kyler
So I was just curious about it and learned it and dove in and why networking people don’t tend towards DevOps. I have no idea. There’s a lot of software engineering that comes into it, and that was sure hard to take up coming from an infrastructure background. That’s the hardest part, but I’m incredibly social for all of the roles I’ve ever had, which really helps. I think I can just attach to you and cling on and learn anything from you, like sponge wise. So that’s the goal of all of us.

[00:03:32.400] – Ned
That totally make sense to me. It definitely takes a certain type of personality to move from your vertical, where you feel comfortable out to this crazy world where you’re getting asked to learn about development. And a big part of that is learning how to automate things, which I guess like in networking, automation feels like it’s a few years behind. Can you tell me a little bit about what your automation journey looked like?

[00:03:56.490] – Kyler
Yeah, absolutely. A couple of years ago, I was a strong network engineer. I’m not Ethan level, but doing great. I own my own consulting gig and was just bouncing around, traveling full time, making great money and hardly ever seeing my partner, which was a little sad, but doing great network wise. And as you read in the industry and you meet people, you see the writing on the wall. Cloud is coming. Cloud is everything. And when cloud is dominant and eats the world, we’re not going to have data centers.

[00:04:32.070] – Kyler
We’re going to have data center or a couple. It’s going to be Azure and Microsoft and AWS, and that’s it, you know, and there’s not gonna be your your data centers. There’s just, you know, your Wi Fi access point and that’s it. So I find it very compelling in an existential sense that I learn automation and I learn DevOps and I learn cloud because that’s all there is going to be in a couple of years. And maybe that’s a little bit of exaggeration. I can be dramatic if you don’t hear that in my voice, but I think that’s what’s coming.

[00:05:07.860] – Kyler
So that’s what I try to teach people is learn your vertical, be excellent at it because it’s foundational knowledge that you can piece together to learn other things, other verticals or and be the bridge of DevOps. But you need to learn cloud, you need to learn automation. It’s coming. It’s going to be your specialty. So get ready.

[00:05:26.750] – Ned
Wow. Yeah. And you mentioned teaching, and you, like me, are a Pluralsight author. So it seems like you like to do more than just a little bit of teaching, right?

[00:05:36.630] – Kyler
I do it’s sort of my hobby, in a sense. I would do it for free.

[00:05:42.321] – Ned
Shh, don’t tell anybody!

[00:05:42.420] – Kyler
So I’ve found a great hobby. I love it. The role that I’m in now, which I think we’ll get into a little bit more later is cloud platform ownership. And I get to just be the expert of cloud and of automation and of platform and tooling, and I don’t solve my own problems. I solve everybody else’s problems, which it’s easier to teach everybody around me to fish than it is to fish for everybody.

[00:06:12.100] – Kyler
So I just end up training everybody to do my job. And I love that. I don’t see any problem in terms of I use this analogy a lot of Kingdom building or moat building around your specialty. You never let anybody in. You can’t get fired, right? But the business is moving. The industry is moving all the time, and if you’re sitting still, you’re getting left behind. So you have a great job security for a while.

[00:06:40.750] – Ned
You build a Castle, you build a moat. Sure, you’ve kept other people out, but then you’ve also kept yourself in. You’ve bricked yourself into this Castle. And then when things move on, you can’t go with them. You’ve had this structure, you can’t move. Wow. I like that way of thinking about it. Now you mentioned your role in teaching people to fish. I think that’s a huge point. I definitely want to hammer on is you have all these other groups that were relying on you to provide a platform, and you don’t want to be overwhelmed by all of them asking you to do stuff.

[00:07:14.640] – Ned
So you have to figure out how do I teach them to help themselves? How do I build a platform that’s self service? And to that end, you created a presentation. I guess it was an internal presentation initially called Terraform to deploy container driven builders in Azure DevOps, which, wow, that’s a title. And then you shared it with the world, which is awesome. So with that title, you’re going to have to break that down for me a little bit. What is going on in this presentation?

[00:07:43.570] – Kyler
You can tell with job titles or presentation titles. I like to just keep adding things on and build this constellation of cool stuff. Yeah. So I own the platform with another engineer, and we also have a half time. So it’s myself and Sai G and Jordan Cook over at Veradigm inside all scripts. And we own the platform, which I’ve said a lot, but I like to define it because people don’t know what that means.

[00:08:11.990] – Ned
What is the platform? Yeah, totally.

[00:08:14.370] – Kyler
We own all of the tooling. We own the how. So if someone says I need to do a specific thing, can you help me? What tooling should I use? How do I secure it? How do I deploy it? How do I make it maintainable? That’s us. We help define the languages and the standards, and if someone has a container that they need to build and deploy and manage. We’ve done that before so we can help you move very quickly. So it’s a ton of fun to be the sharp edge of getting stuff done.

[00:08:46.290] – Kyler
And I think I’ve lost the point of your question. I’ve lost the plot, but I’m having too much fun just talking.

[00:08:52.750] – Ned
It almost sounds like a center of excellence. We did a whole show on Cloud Center of Excellence, and that was more of a consultative thing where there was a cloud center of excellence that would create some standards, but they weren’t directly necessarily teaching anybody or running the clouds. They were just there to create and help develop standards that other groups would adopt. This sounds a little bit different than that.

[00:09:16.880] – Kyler
Yeah, totally. I think there’s two functions that I usually bundle up into one. So function one is we establish the standards and also build supportive infrastructure to lower the bar. Like, someone needs to deploy a container. There’s a lot of supportive infrastructure there that you need to check it into a CI. You need to test it. You need to deploy with the CD, run it, maintain it, manage it, monitor it. There’s tons of stuff that could take you six months if you’re doing it all by yourself.

[00:09:46.870] – Kyler
But if we have that supportive infrastructure in place, ready to go, you just check in a Docker file and I can do it. I can do the rest of it in a day because we’ve spent the time. So that is totally starkly different from when someone comes to us and says, hey, I don’t understand what Docker is. Can you explain why it matters what it is? What is Terraform? And why do I care? I get that all the time. So I’m constantly sort of proselytizing my religion of Terraform.

[00:10:15.190] – Kyler
My wife is not even in tech, and she is very tired of hearing me say this word. Terraform.

[00:10:21.250] – Ned
I’m laughing because I can relate. I use a lot of Terraform, and I’ve done courses on Terraform. My wife kind of knows what it is now. She’s like, oh, God, that again. But, hey, I love it.

[00:10:35.440] – Kyler
Totally! And on this specific presentation, this was when we try to get ahead of the challenges that are coming for our application developers. So we can see that we had a couple of teams that were starting to create Docker files and deploy them and manage them inside of Kubernetes. And they were having a lot of troubles. Like, how does networking work inside Kubernetes? How does DNS work if I want to have an SSL Cert, what name do I use? Because Kubernetes has its own DNS namespace. The challenge is that I have no idea.

[00:11:08.920] – Kyler
I haven’t done this before. So when we identify stuff like that, we go out and proof of concept in the infrastructure. The supportive stuff that we’ve built and we maintain is there an opportunity for us to integrate the new Hotness, the new cool thing into it. And it’s maybe a little heavy, like, we don’t need Kubernetes to run our CI CD, but we can use Kubernetes to run our CI CD.

[00:11:31.820] – Ned
Why not?

[00:11:33.130] – Kyler
So that was part of this. We decided to convert all of our builders, all of the hosts that are registered to our CI CD and run all the pipeline jobs into Docker and have them automatically rebuild and automatically patch and deploy and rebuild all sorts of stuff. Not because the CICD needed it. I do think it benefited from it, but it’s more because I know in the near future, an application developer is going to say, hey, I need to do this. Can you show me how? And I can say I’ve done that. Yeah. Absolutely.

[00:12:06.110] – Ned
Okay. So you’re not only building the platform for yourself, for jobs that come in and want to use that CI CD pipeline, but if another group is thinking about adopting the same technology, you’ve got a template. Now you can say, oh, hey, I have this workflow figured out. You can borrow it and build off of it. And they’re not reinventing the wheel themselves.

[00:12:25.690] – Kyler
Absolutely. Because every Dev team, we operate as kind of a skunkworks inside Veradigm, which I think is becoming more of a common practice that you have these small strike teams, two pizza teams in the AWS lingo that are working on their own thing. And if all of them need to learn Kubernetes or Terraform or CI CD separately, that’s a lot of overlap. Right. That’s a lot of friction that we can remove from the system and improve everybody’s lives. So that’s our goal. No more friction, right.

[00:12:55.140] – Ned
Devops want to Dev. Let him Dev. Why not?

[00:12:57.440] – Kyler
Totally. I don’t speak Java. You do your Java, I’ll do the rest.

[00:13:00.870] – Ned
And I don’t want to speak Java. I did that in College and I’m dating myself here because this was Java in the early 2000s, but I had to learn Java, and as soon as I could forget it, I did. And I’m better for it I think.

[00:13:16.500] – Kyler
100% same PHP and JavaScript, and it’s gone. I don’t remember it at all.

[00:13:22.150] – Ned
So you’re using Docker to build agents that are going to run your CI CD pipeline. Have I got that correct?

[00:13:32.310] – Kyler
Yeah. Absolutely.

[00:13:33.520] – Ned
Okay. What did you do prior to implementing this?

[00:13:38.410] – Kyler
Totally. So we did the MVP, the very easy and some of your listeners are going to cringe, I think, because this is not a good solution, which is why we iterated on it. We started with just simple, EC2 or virtual machine instances in AWS or Azure that are long lived. They’re just simple machines that you would install Windows or Linux and you put this little implant on it that registers it to your CI CD in a pool and they take jobs and they’re long lived. So they receive a job and they run it and they just continue going on and they get the next job and they run it, which works great.

[00:14:15.770] – Kyler
Except we have a lot of different teams that need a lot of different things, like different versions of Java, different versions of all sorts of tooling and Devs are going to find a way. They are ingenious at finding a way. So if the wrong version of Java is on our builders. They’ll add a step to their pipeline to install their version of Java, which works great for them. And then the next person that runs their job hasn’t touched Java. It’s suddenly broken. So we have all of these really fragile relationships that are only surfaced when a Dev team goes and does something, which is constantly.

[00:14:55.870] – Kyler
So this was a very fragile security problematic design. That was very simple. It was easy to get off the ground. It definitely showed its problems as we started to scale.

[00:15:11.150] – Ned
Got you. Okay. So your builders are just virtual machines, like you said or EC2 instances, and they are hooked into I believe you’re using Azure DevOps pipelines to run your CI CD pipeline. So when a pipeline kicks off, it needs to go run somewhere. And I know Microsoft provides hosted agents. You didn’t want to use those. You want to use your own self hosted agents. So you install the agent software on a virtual machine and it dials back to Azure DevOps and says, hey, I’m here. I can run a job.

[00:15:41.490] – Kyler
Yeah, absolutely.

[00:15:42.650] – Ned
It sounds like the problem is as you express it when one person goes and runs their build, and maybe they’re using Maven or whatever to build their application. And it’s a Java application. They need a specific version, and then the next person needs a different version. So you got that detritus laying around your image. But you mentioned the security issue. So what was the security issue that you were seeing with this long lived runner host that you had going?

[00:16:12.670] – Kyler
Absolutely. And this is systemically a problem with Azure DevOps. Maybe if there’s any Microsoft engineers out there that want to fix this, please do. What we noticed was when we would run a job. A lot of these CI CD jobs need access to secure files, secure variables or SSH keys or private keys. They download all sorts of stuff to do their jobs. And when the jobs are finished, they do not automatically clean up their workspaces. We can add a step that deletes it, but that’s a manual step that we had to invent in house.

[00:16:48.370] – Kyler
And the problem with that is these builders are long lived and their work spaces are long lived. So if someone submits a malicious job that uploads, that workspace to a third party or their own Dropbox, they’re getting all of the staged secure files. For however long that builder’s workspace has been active, which for our old model was months or years. That’s a lot of data. And in a regulated environment like we’re in healthcare, that’s very scary. That’s a very poor security practice.

[00:17:24.670] – Ned
Got you. Okay. And I guess the other downside to that beyond just security and your application teams Being at loggerheads is when there’s no jobs running that agent is just hanging out, consuming compute time, doing nothing.

[00:17:41.110] – Kyler
Absolutely.

[00:17:41.610] – Ned
It would be nice to just shut it down. Did you try something like implementing virtual machine scale sets or auto scale on AWS to at least make the cluster more dynamic.

[00:17:52.570] – Kyler
On the AWS side, we never did. It’s just a T2 micro, so it’s like $15 a month or something. So I didn’t really care. On the Azure side, Sai G, my partner in crime in Veradigm is an Azure expert, and he used scale sets to every couple of hours or every day rebuild the hosts, which is great. You’re still susceptible to this problem if someone could submit a malicious job right after a valid job, but it sort of minimizes the blast radius in a great way. So yeah, Packer built images managed through scale sets.

[00:18:30.910] – Kyler
They’re not built on demand, but we have enough jobs that flow through on the Azure General compute pool. That didn’t make sense for us.

[00:18:38.630] – Ned
Okay.

[00:18:39.030] – Kyler
We did look at it later using like, code build or something to queue up builder job, but that feels a little heavy for us, given how cheap the static infrastructure is.

[00:18:50.170] – Ned
Okay, now you mentioned that you’re using Packer, and when you sort of posited the problem, my first thought immediately went to Packer. Like if I was building builders, building virtual machines or virtual machine images that I wanted to run agent jobs, I would build the image using Packer. Were there some downsides to using Packer that made you then go look at Docker instead?

[00:19:16.150] – Kyler
Only that it takes a long time to do its job. I think otherwise, it’s basically equivalent to Docker. There’s architectural differences, but operationally kind of the same. However, building stuff with Packer takes a really long time. I think we had a Windows build that takes eight to 10 hours, so it has to run overnight.

[00:19:36.770] – Ned
Wow.

[00:19:37.170] – Kyler
And if we want to, yeah, we customize the heck out of our stuff. But if we want to expose this process in a transparent way to all of our teams so they can start customizing and building their own stuff. That development loop of eight to 10 hours to see if their change worked? Like that’s a whole work day. They would have to submit a change and then check tomorrow, whereas with Docker, it could be hours, but generally it’s minutes to see whether their thing worked. And so that development cycle being that much shorter is a huge win for us.

[00:20:12.910] – Ethan
[AD] I interrupt this podcast conversation and possibly myself to explain who the heck sponsor Apstra is. In a nutshell, multi vendor network automation plus continuous validation. And I stress multi vendor, because if you’ve been paying attention to acquisition news, you know that Apstra was bought by Juniper a while back, so you might be thinking you don’t care about Apstra unless you’re a Juniper shop, and that is just not true. Apstra can handle data center network automation across a spectrum of vendors. So what do we mean by data center automation anyway?

[00:20:42.230] – Ethan
We mean that you design the DC network to meet some business requirements you have, and you do that within the Apstra interface. And let’s say it’s leaf spine with EVPN. Apstra’s got access to the network devices themselves, and it takes your intent to create that leaf spine physical network with an EVPN overlay and configure it for you. I mean, Apstra can’t plug the cables in for you, right? You still have to do that bit, but Apstra can tell you when the cabling is out of whack, whether that’s during the day zero build out phase of the day two.

[00:21:09.950] – Ethan
Hey, it looks like an optic failed phase, and that’s sort of the point here. Cabling routing relationships, device and link addressing inter switch links, VLANs, VTEPs, mapping tons of these things, so many that you don’t want to have to do that configuration yourself. It seems fun until you’re actually building it and then you realize it’s totally not fun. You want software to stand up the data center fabric for you. Software is not going to fat finger an address. Software is not going to forget to update BGP policy.

[00:21:39.740] – Ethan
Software software loves you. Okay, not all software loves you.

[00:21:44.210] – Ethan
But Apstra software does so much so that it not only helps get that fabric built, but keeps it built the way you intended. Something goes out of spec, Apstra will enforce your intent, which should help you reduce security vulnerabilities, by the way, and alert you to the bits that need a human’s attention. Apstra claims up to 80% improvements in operational efficiency, 70% improvements in mean time to resolution, and 90% improvements in time to deliver. And that is a lot of love. Find out more at Juniper dot net slash Packet Pushers slash Apstra, if you’re a data center network engineer, this is worth your investigation once more.

[00:22:21.690] – Ethan
That’s Juniper dot net slash Packet Pushers slash Apstra. And if you talk to your Juniper rep about Apstra, make sure to tell them you heard about them on packet pushers Juniper dot net slash Packet Pushers slash Apstra. And now back to the podcast. [/AD] [00:22:39.010] – Ned
Now, one thing that you mentioned, and this is something I haven’t really heard very often is Windows and containers, like, in the same sentence, you’re really legitimately using Windows containers. So tell me what that’s like.

[00:22:52.090] – Kyler
Absolutely. And what it is like is painful. So when we first started, we were so naive and we just thought this must be a problem that’s solved. The new Microsoft, the Cloudfriendly Dockerfriendly dot net ported Microsoft. Surely they’ve put Windows on Docker. Kind of, it sort of runs. Our idea was that we would get rid of all of our thick, Packer built hosts right away, like really quickly replace them all with Docker. Blaze the trail for everyone else. And Windows, we are running into so many issues as we do that I think we are still progressing, but it’s very slow. Where some of these tools, like Dot Net builds, require a certain Visual Studio code to be installed, or a certain Visual studio.

[00:23:44.770] – Kyler
And there aren’t command line installs of those tools, because why would you have a command line install of an IDE. So we are just Frankensteining the heck out of it, and it’s working, but it’s very slow, and sometimes we need to look to third party tools, but we have to be very careful and cautious around those again, just regulated environment. So if it’s going to be touching any kind of health care data, we need to know the origin of all that code so slow and steady, but we’re working towards it.

[00:24:16.560] – Kyler
I swear, Sai at some point is going to be on this same podcast and he’s going to talk about his great new Windows containers.

[00:24:22.630] – Ned
I will take him up on that because I’ve legitimately been looking for a use case for Windows containers for quite some time because I’m like, why would you do it if you’re going to make the jump to containers? Why not just use dot net Core and Linux and call it a day? But this is a legitimate use. I want to use containers as my build host, and some of my builds need Windows, so. Okay. Wow. We found it. We found the one use case.

[00:24:51.790] – Kyler
We think it makes tons of sense in context, but operationally, it has been so painful to do and even just Docker in general, we’ve run into so many issues, like our nesting doll problem, where we converted all our builders to Docker. We open the champagne and we celebrated, and then we tried to build the Docker image on them. And Docker can’t build Docker. And so we were immediately stymied, but like, oh, well, good job. You have blocked your build process. Totally. So we reached. Like surely, someone has solved this.

[00:25:27.090] – Kyler
I say that a lot because I feel like maybe we are on the cutting edge. I don’t think that we are, I don’t frame myself that way, but we’re running into issues that others haven’t solved yet. So GitHub, Azure DevOps they use Packer build runners for all of their hosted running. And surely it would be a lot cheaper to use Docker like, share the kernel. They’re much more lightweight, they’re quicker, but people got to build Docker. So I think that’s probably one of the major components, major reasons that they have not converted to Docker build runners. So we’re striving.

[00:26:03.520] – Ned
That gets into the next thing I was going to ask about, actually, which is your build process for your builders. Because it sounds like we got two different things going here. We got the process by which we create the builders and that could be one pipeline, and then you have the actual pipelines that are running using those builders. So making the builders, where are the agents that are making the builders? It sounds like you have the nesting doll problem. So what does that process that pipeline look like that builds the builders.

[00:26:36.160] – Kyler
Totally. Great question. It was going to be the same. It was going to be the same host. We would have the agent build the next iteration and then deploy the new agent and on paper. That’s perfect. That looks great. But in reality, Docker can’t build Docker. So we have been using the public builders to build the builder image to build the Docker container image and then deploy it privately to run all the rest of our jobs. And that’s been okay. That’s definitely sort of a liminal state where we don’t want to live forever.

[00:27:08.020] – Kyler
So the primary reason for that is Caching, which when you’re using these public builders, you have none in terms of the Docker build process. So our builds take hours instead of minutes for the Windows side, at least.

[00:27:23.290] – Ned
Right.

[00:27:23.740] – Kyler
So we’re thinking of using a Packer built auto scale group going back to the beginning, going back to those Azure builder images. So it’s all a loop is the bottom line of this talk. It’s all a loop.

[00:27:35.800] – Ned
Right. But that’s a specific use case that you can handle with those Packer build images, and then everything else can use the builders that are spawned by this pipeline. I didn’t think about the Caching problem because we you said hosted instances. I was like, oh, it’s great. You’re just leveraging instances that are available and out there, and you can put everything in Azure key vault to keep all the secret stuff off of the host and just pass it through environment variables. And then you put the artifact in Azure container registry or whatever in a private registry.

[00:28:05.010] – Ned
So you don’t have to worry about that. Maybe I’m ticking some boxes that you already know, but I’m building this out in my head a little bit, but yeah, then you get into I need to pull a Windows image to build off of. And that thing is probably a gig and a half, at least.

[00:28:20.510] – Kyler
Yes, it’s bulky. That Windows kernel is just bulky just compared to Linux. It’s funny how well the Linux side works. It’s just widely used everywhere. Google and Netflix are just full of these containers that spin up millions or billions of times a day.

[00:28:38.940] – Ned
Right.

[00:28:39.520] – Kyler
But Windows images don’t run. So it’s really a stark difference. But yeah, we’re using the public builders to build the container image, pushing it to, like, ACS or ECS. And on that note, we try to just have one image definition and push it to both places. So if we’re having Ubuntu 20.04, we build it one time and we push it to Azure, we push it to AWS, and we use it in both places to try to just bridge the gap between the clouds. I know that’s going to be prediction that others have predicted before me.

[00:29:18.650] – Kyler
A major factor of a lot of new tech is Kubernetes clusters that span both clouds or overlay networks that make two clouds seem like one. We’re seeing a lot of that. So we’re trying to get in line and notice the change in weather and get ready for it.

[00:29:35.110] – Ned
Got you one thing that jumps out of me to deal with the Caching issue is you could potentially Mount an Azure file system on the hosted builder and then use that to have cached images already there. And that’s the skunkworks thing we can talk about off mic.

[00:29:56.290] – Kyler
Totally. I attempted that where you connect to the ACS and you download the image. But those images, particularly Windows, get to be really big because it’s not just the kernel. It’s that Docker layering. As we install all the visual Studios that are potentially needed, which there are several. It gets to be like five to ten gigs and takes an hour to replicate from ACS. So I tried and it took like five minutes less and it had way more complexity. So I just dropped it. I’ll just wait five more minutes for simplicity.

[00:30:27.500] – Ned
Yeah, absolutely.

[00:30:28.480] – Ned
Sometimes it’s not worth it. The juice isn’t worth the squeeze, as my friend Bobby likes to say. Okay, so let’s walk through a typical pipeline when you want to create one of those builder images. What’s the trigger that kicks off a new image build? Do you have it just running daily, or is it more of a GitOps style? I did a commit or a push or a PR.

[00:30:50.650] – Kyler
And the answer is yes, because I find both of those to be very valuable. So we have Docker files and the startup entry point script checked into our CI, which is Azure DevOps, and we ask that anyone ourselves included. So we made this very transparent and easy and documented because we want all of our teams to say, I need a new tool and we say, Well, go install it. Here you go. Here’s your fishing pole. Figure it out.

[00:31:18.200] – Ned
Right.

[00:31:19.330] – Kyler
So we check in all those files. We have PRs against them. And when a PR is done, we automatically test it by building the image. We would love to do more testing in the future, like a demo pipeline that queues a job and see if it runs. I love that none of that is in place today. It’s just if the image builds with Docker, we call it a thumbs up and we get some approvals, some human approvals. I like to call out like the robots approved it because Docker runs.

[00:31:49.340] – Kyler
Now we need some humans to say this actually makes sense.

[00:31:52.120] – Ned
Right?

[00:31:52.670] – Kyler
And then we merge the PR and that triggers a deployment, which pushes it out to the ACR and the ECR. And because these containers are very short lived in the sense that they only run a single job, they’re rapidly switched out. We don’t do any kind of, I think on Azure, we have AKS do some, like blue green bot herding. When there’s a new image you spin the old stuff. On the AWS side. I just destroyed the host after a job. So if your job doesn’t work and you need a new thing, just run it twice.

[00:32:24.550] – Ned
Makes sense. How are you tagging those images after you do the merge of the PR are you just using Latest or do you have a semantic versioning you’re using for everything?

[00:32:35.230] – Kyler
We are just doing latest, which I know some people are cringing. I think this is again a kind of a liminal iterative state where we’ll eventually get to semantic versioning and start using it. But right now and especially maybe this is a CI CD problem. I don’t see the benefit of doing semantic versioning, given that we’re tracking all our changes in the CI. So if anything breaks, I can just roll back the CI. So I’m sure there are some varied opinions here. I wish we could take calls because someone could call in and argue.

[00:33:11.050] – Ned
Yeah, we should redo this as a Twitter spaces and just have people drop in with questions. I’ve been looking for use for Twitter spaces too. Now Windows containers. I got some ideas now.

[00:33:21.800] – Kyler
That would be great.

[00:33:22.920] – Ned
I can see it where if someone knows they’re using your builders for their app run and they found a version that works and they just want to stay on that version until the next major Rev comes out. You could have some minor and major versions and they could stay on the major version until you Rev that and they could use Tags to do that. So I guess that’s one potential use case, but that’s a pretty advanced use case where this whole system has been in place for a while and an application team is like, oh, we really like that one build.

[00:33:52.790] – Ned
It seems to run fast.

[00:33:54.280] – Kyler
Absolutely. 30 seconds in. I think you’ve converted me. That’s a great use case. Absolutely.

[00:33:59.590] – Ned
Wow. Look at that. All right.

[00:34:02.890] – Ned
I think I have a pretty good idea of the building the builder’s pipeline. Your talk has Terraform in it, and so far we’ve been talking about Docker. We’ve been talking about Azure DevOps. We haven’t been Terraform at all. How is Terraform being used in this whole process?

[00:34:18.550] – Kyler
For sure. So I am an AWS guru. I think like you as well. I love my AWS. So in order to deploy the ECS, and the ECR registry where we store the image and the ECS that runs it as a service as well as all of the IAM policy, and there is a significant amount. We use Terraform. I built generalized Terraform modules that do all of it. I think when I gave my talk, I shortened 250 lines of configuration down to like, 17 lines when you call the module so much simpler for when people call it.

[00:34:58.620] – Kyler
Right. A lot of that security is just because it’s multi account. Our environment, like a lot of AWS environment scales horizontally rather than vertically. So we have many AWS accounts and they’re all little Islands, which is so it’s very different from Azure. I’m having trouble learning Azure because it’s implicitly Federated. The security just works.

[00:35:20.650] – Ned
Right.

[00:35:21.830] – Kyler
So we store the image, the container image as well as the secret as well as the CMK that encrypts the secret in just a single account to minimize replication problems and things like that. And then we just sort of empower through IAM all of the other accounts to talk to it and grab what they need very specifically. So tons of very specific IAM policies to get it done. Lots of iteration went into that. I have it working it’s on GitHub if you all want to copy my good work.

[00:35:53.900] – Ned
Absolutely. I might take you up on that because IAM policies and roles are the bane of my existence. And I think.

[00:36:00.670] – Kyler
Absolutely

[00:36:01.510] – Ned
That might be true of a lot of AWS people just like, oh, God, I know it’s IAM and I don’t know how.

[00:36:07.970] – Kyler
I think it’s Windows registry ask everything runs on it, but it is just a bear to dig into and understand. Yeah, I’m part of a separate group that’s starting up called IAM Pulse, which is spinning out of Octa. Their whole focus is just writing IAM policies and helping provide templates, and they found a whole business niche of just helping people lower the bar there because it is painful sometimes.

[00:36:31.960] – Ned
Absolutely. I’m surprised you’re not using Terraform to build the Azure DevOps project in portion as well.

[00:36:37.734] – Kyler
No.

[00:36:40.230] – Ned
Because they do have a provider.

[00:36:41.070] – Kyler
All that’s by hand. I would love to do that. I can’t believe it. Did it work? Well, is it maturing?

[00:36:45.550] – Ned
I have a GitHub module. I have a module on GitHub that does exactly that creates a project and will set up variable groups and service connections and all that. I’ll share that with you after.

[00:36:57.550] – Kyler
Oh, that sounds great.

[00:37:01.130] – Ned
You know, if I can Terraform it, I absolutely will. And that was a good example.

[00:37:05.440] – Kyler
I can’t wait.

[00:37:06.830] – Ned
Let’s move over to the run side of things because we talked so far about building the builders, but then you need to use those builders. So where are those builders running? And how are folks taking advantage of them in their pipeline?

[00:37:21.290] – Kyler
Totally. When I first started learning about Docker, I assumed it worked a lot like VMware, where if you have an image, you just sort of download it and then you run it and they’re sort of linked the build and the run is one thing. But in Dockerland those are totally separate. Build and stage is step one totally separate from the run stage of spinning up the image and getting it going. So on our run side, it’s different based on cloud, just based on the technology we’ve chosen. On the Azure side, we use AKS, which allows some really cool, like, intelligent blue green provisioning.

[00:37:59.760] – Kyler
And when there’s a new image, we spin down the old image. Really cool stuff. But the stuff that I built that I love to wax poetic about is the AWS side where, like I said, everything is stored in one account. IAM policies are permitted for all the other accounts to grab the things, and I have it deployed with ECS. I get asked a lot. Why didn’t I use Kubernetes on the AWS side. I just don’t really understand it very well yet, but it’s on my list, my very long list, and it does everything that I need in terms of just it goes and gets the image.

[00:38:34.850] – Kyler
It downloads the secret and it spins up a couple of hosts. It also supports auto scale targeting, which is cool. I sort of hacked together auto scale targeting to spin down the pools overnight. So we do kill the container after every job, but say, it’s a pool that sits for a month. I don’t want to have it like out of date. It hasn’t spun down. So I zero out all our container pools at night for five whole minutes, and then I spin them back up because when they spin up, they grab the new version.

[00:39:06.490] – Kyler
So it’s a very simple hacky solution that works great.

[00:39:11.990] – Ned
Simple hacky solutions. There’s absolutely nothing wrong with that. I’m all about it now. Do you have dedicated pools of builders for specific pipelines? So if an app team comes along and they’re like, we want to build a pipeline for our application, do you give them their own dedicated pool, or is it more of a shared resource pool situation?

[00:39:32.630] – Kyler
Again, the answer is yes, because we are trying to be everything for everyone. So we do have this general compute pool where if anyone wants to update it and it’s universally compatible, which is strikingly few changes, they can go find the Docker file and update it with a pull request and deploy it. But if an app team needs something that is specific to them, like some teams are pinned to old versions of Java or Maven or Gradle, they can check in a Docker file. And we have built our infrastructure in a horizontally scalable way where we can deploy images, deploy containers just for them in a pool that’s specific just for them.

[00:40:12.950] – Kyler
So if they want older software or specific, really heavy software, they can just wait around for them, which we’ve used a couple of times already. That’s the goal is we have the Gold Master pool that’s ready for you, but if they have a specific need, we can do that too. Just give us a Docker file and we’ll do the rest.

[00:40:34.160] – Ned
Okay. How are you scaling and creating these builder agents? Is there something in the pipeline that goes out and says, all right, go spin up a builder instance before it does anything else, so it knows it can handle it. Or do you have a pool that’s always on and ready to go? Which sounds kind of wasteful. I think that’s what you’re trying to get away from.

[00:40:54.120] – Kyler
Yeah, and right now we’re doing the wasteful one because they’re kind of cheap. We looked. Oh, I do remember the number now it’s $80 per account to keep that thing running every month to keep it running. So it would probably take you 12 hours, which would cost a couple of $1,000. So I would love to evolve it to a state where it is on demand instances spun up, and that way it would always be specifically the correct and newest version of our builder because it’s built on demand.

[00:41:25.310] – Kyler
And I think I could probably do that with code build pretty easily, but it sounds hacky when I describe it because I think it would be a little hacky. So you would run the first stage of your pipeline on the public builder because it’s always available. It’s available on demand from Microsoft or whomever, and it would send a post or get request to the code build URL to say, spin up one host, and then that host would register in your pipeline to run stage two.

[00:41:53.220] – Kyler
That would actually be the job. I’m sure that would work, but it just sounds so hacky. Yeah, I just haven’t pursued it yet, but I’m stewing, I’m getting there.

[00:42:02.590] – Ned
Oh, dear. All right. I’m putting ideas in your head. I’m so sorry.

[00:42:07.190] – Ned
It sounds like you’ve already automated a lot, and you’ve optimized a lot for cost, and you’ve addressed some of the major issues that you found, which was the security problem. Well, you don’t have that security problem anymore because it’s a fresh build every time and then conflicting versions of things that the Dev teams are going to install when it spins up. Nope, because they can install whatever they want because it’s going away in a second, and you don’t have that long lived instance, and you’re not using Packer anymore.

[00:42:38.030] – Ned
You’ve got to move over to Docker. I think I already planted some seeds for you, but what other things do you have in your mind going forward that you’d like to see in this project?

[00:42:50.150] – Kyler
Absolutely. I think there’s two major things. One is automated testing. When someone makes a change to a Docker file or proposes a change in a PR, we’re just making sure Docker builds. You can do some terrible stuff and still have Docker succeed, it will destroy everything. So I would love to have it test. And that’s kind of true of all of our automation. I want to see a lot more computerized validations built into everything we do, unit testing and the semantic version testing and just anything we can.

[00:43:26.330] – Kyler
So that’s a major one. And another thing that we’d like to do is build depth scaling. So when we have 50 jobs posted all at once, I would love for our CICD to see that and spin up a bunch more builders to manage it. Cause Docker does spin up quickly, but it’s still maybe a minute or two for another host to spin up. So I think generally we just run two or three containers in every pool in every account. And that’s enough if you queue a couple of jobs, but because we’re solving for everybody’s problems that we’ll ever see, we need to make sure we can handle more.

[00:44:06.040] – Kyler
So some of our teams, like our big data team works with a tremendous amount of data and sometimes has to run like R statistics, things that just need a ton of builders running in parallel. So if we needed 100 jobs queued and run in five minutes, can we do it? And right now the answer is no. So I’d love to get there and solve that problem before it arises. Before someone asks me, I want to be able to say yes. Got you.

[00:44:34.310] – Ned
Okay. That makes a lot of sense to me. And there’s definitely ways that you could do it with Kubernetes. You could have something that’s sort of listening and looking at that build queue, and when it sees the build cue hit a certain threshold, it would automatically scale out the number of instances or pods that are running to handle the build process and then potentially scale back in. So there’s definitely some things you can do there. Maybe with Keda? I’m not sure, but now I’m just throwing words out and.

[00:45:03.120] – Kyler
I love it. That’s great.

[00:45:06.510] – Ned
Well, this is fascinating and we can probably talk about this for another 2 hours. But sadly, we are coming towards the end of the ride. So if you had a few key takeaways for folks out there, some things that they could maybe put into action or chew on as they’re wrapping up the episode. What are some big takeaways for you?

[00:45:25.750] – Kyler
Sure. Big stuff is your CI CD should be transparent and auditable. There shouldn’t be a dependency on that one really smart engineer that knows how everything works. The Brent of the Phoenix project, you should embed that knowledge, procedurally, process wise, engineering wise into your tooling so that anyone can access it and that democratizes everything that empowers your whole team. It’s a huge thing that I try to advocate for that I don’t see very often. So I will talk about that all day. Docker Terraform. Other tools that convert sort of human readable language into infrastructure and into processes are just super powerful.

[00:46:08.550] – Kyler
And if you’re not using them should be. And if you are using them, good for you and keep developing, keep advocating because we need to see it. It helps people learn and it helps more people get into these industries and do really well. And number three is just standard technologies like you’ll see, Docker, you’ll see overlay networking start to become much more prevalent that syncs and ties clouds together. Look towards those technologies because that is what I see coming down the pipe and what I think will become very influential in the next five to ten years.

[00:46:43.160] – Ned
Awesome. We will include links to your presentation and a bunch of other things that you’ve thrown in the show notes. We’re definitely going to include those if listeners want to know more. Are you a social person, Kyler? Is there somewhere people can follow you either on Twitter or LinkedIn?

[00:46:58.950] – Kyler
Absolutely. So I am on LinkedIn. I’m very active. Just Kyler Middleton. There’s not very many Kyler Middletons in the world with the K like Kangaroo so you can find me. I’m also kymidd on Twitter and on medium. Easy to find and kylermiddleton dot com and check it out.

[00:47:19.260] – Ned
Excellent. Thank you so much, Kyler, for being a guest today on day Two Cloud and hey, listener out there virtual high fives to you for tuning in. If you have suggestions for future shows, we would love to hear them. If you built some crazy contraption or have a post that has way too many words in it. We want to hear about that post and maybe talk to you. So hit us up on Twitter. It’s at day Two Cloud show. Or you can fill out the form on my fancy website nedinthecloud dot com.

[00:47:46.730] – Ned
A bit of housekeeping here. Did you know that Packet Pushers has a weekly newsletter? It’s called Human Infrastructure Magazine and it is loaded with the best stuff we found on the Internet plus our own feature articles and commentary. It is free and it does not suck. You can get the next issue at Packet Pushers dot Net slash newsletter. Until next time. Just remember, Cloud is what happens while IT is making other plans.

Episode 128