Follow me:
Listen on:

Day Two Cloud 127: Avoiding Infrastructure As Code (IaC) Pitfalls

Episode 127

Play episode

There are a lot of good things you can do with Infrastructure as Code (IaC) for automation, repeatability, and ease of operations and development. But there are also code and infrastructure pitfalls where you can tumble into a hole, break your leg, and get eaten by spiders.

OK, maybe not  that bad, but on today’s episode we talk about potential IaC pitfalls and how to avoid them.

Our guest is Tim Davis, DevOps Advocate for env0.

We discuss:

  • Criteria for picking the right IaC framework and tools
  • Cultural vs. technical pitfalls
  • Handling a diverse set of targets
  • Rollback and your framework choice
  • Security ownership in IaC
  • Why testing and more testing are essential
  • More

Show Links:

@vTimD – Tim Davis on Twitter



[00:00:04.310] – Ethan
Welcome to Day Two Cloud. Oh, baby, infrastructure is codenet. Infrastructure is code, and the pitfalls thereof is our topic. Today. Our guest is Tim Davis, and Tim had given a talk at a conference we’re will mention along the way and was really rushed in this conference trying to share his wisdom, and we were able to get him on the show to dive more into the pitfalls of infrastructure, which makes it sound like a negative topic. But it wasn’t really a negative topic, was it, dad?

[00:00:32.310] – Ned
No, very positive, because there are a lot of awesome things you can do with IAC, assuming that you approach it correctly and you try to avoid those pitfalls. So we had a great conversation about the people, the process, and even the tools you might use to implement infrastructure as code.

[00:00:48.870] – Ethan
Yes. And lest you think infrastructure AKS code is like one of those things the other people are doing. No, no, we’re going to hit this right at the top. That infrastructure code is for everyone. Enjoy this conversation with Tim Davis. Tim Davis, welcome to Day Two Cloud today. It is a pleasure to have you, sir, after hearing your talk in all day DevOps and for the folks that don’t know you in just a sentence, who are you and what do you do?

[00:01:12.810] – Tim
I am Tim Davis. I am the DevOps advocate with env0.

[00:01:16.950] – Ethan
Okay, now your talk that I was referring to is the pitfalls of infrastructure AKS code. Well, let’s start with a little bit of just generic IAC talk. Who is infrastructure as code for Tim?

[00:01:31.590] – Tim
Anybody at this point. So this is kind of an interesting space that I’ve seen. There are so many different personas, if you will, that are using it. You’ve got the infrastructure folks that are trying to make their lives a little bit easier. You’ve got development folks in smaller shops that are having to handle the infrastructure themselves. It really kind of goes across the board.

[00:01:53.730] – Ethan
Okay, everyone, that is fair. Especially when we get into your talk, which you addressed, notably Pitfalls, which really grabs people’s attention. Wait a minute. Is IAC bad? But really, you’re just getting into the practice of IAC and kind of ways to make mistakes you could make that you shouldn’t make. Well, okay, in that talk, you mentioned that IC has the pitfalls of infrastructure and code. What are you getting, Tim? It sounds so serious, Tim. Why so serious, Tim?

[00:02:25.410] – Tim
Yeah, it’s just one of those things that when you’re thinking about what kind of problems you can run into with really, any kind of project. There’s lots of things that come along with lots of different topics. And with infrastructure AKS code, you kind of run into a combined set of issues. All the different things that come along with designing and deploying infrastructure are there. And a lot of the different issues that come along with designing and working with code are there. So if you’re just coding.

[00:02:52.150] – Ned

[00:02:52.540] – Tim
If you’re just doing infrastructure.

[00:02:53.990] – Ned

[00:02:54.350] – Tim
But this kind of brings all of those things together, and you can end up with a lot more gotchas issues.

[00:03:00.550] – Ned
Pitfalls. What have you then you think you might when I think about the way that infrastructure and Ops folks approach things, we’re a little more cautious, right. Because we know that making decisions can have catastrophic consequences. So we don’t want that to happen. And then I think of the development perspective and they’re like, Move fast, break things. It’s all good. I can fix it in code. I can push out a new release.

[00:03:23.410] – Tim
It’s going to be a problem.

[00:03:24.770] – Ned
We can do blue green deployments. It’ll be great. How do you marry those two worlds where I’m like, Whoa, slow down their buddy. And they’re like, I want to go fast. Yes.

[00:03:35.630] – Tim
And this is one of those issues that everybody’s been trying to solve for the longest time. My joke is that I bring the Ops to DevOps because I come from the infrastructure engineering side of the house and the operations side, but really fostering that communication like these two groups, just because you’re doing infrastructure as code doesn’t mean you’re giving the infrastructure job to the developers, or you’re taking away the development job and giving it to the infrastructure people. It still takes both of those teams. It’s a collaborative effort to be truly successful and make sure that not only are you doing it right and looking out for those issues from an vROps perspective, but that you’re moving fast and everything like that from the Dev perspective.

[00:04:15.660] – Ethan
Wait, wait. Don’t I just hire a DevOps engineer, Tim, that marries the best of both worlds for so many years.

[00:04:25.920] – Tim
I kept saying that the job of DevOps engineer doesn’t exist. That’s not a job title, but these days it really is. But I’ve seen DevOps engineers and SREs that used to be Dev or used to be Ops, and it’s one of those things where it does exist. But still, it kind of takes the expertise and knowledge of both together to really make sure that you’re mitigating any issues and being successful. Yeah.

[00:04:51.960] – Ned
I want to Zoom in on that a little bit because I think we got this idea in our heads. Everybody’s talking about shift left, push it all to the left, which in some people’s minds, is like, that means push it all to the developer. So it’s the developer’s problem to stand up infrastructure and secure things properly and have 20 years of vROps experience without ever having touched a data center. What is your perspective on how that actually should be approached?

[00:05:18.650] – Tim
Yeah. So it’s one of those things that just because you’re moving certain steps in the process closer into the development process, and you’re using those development methodologies. You’re using development tools, like when you’re doing security as code, you’re doing policy as code, you’re doing performance testing and stuff like that just because you’re integrating these into, like a CICD pipeline. And you’re using GitHub’s methodologies and things like that. That doesn’t mean that you’re getting rid of the people’s job. It just means you’re giving them a new tool set or a new process for doing the AKS that they’re supposed to be doing.

[00:05:54.130] – Tim
And again, it’s just one of those things where a lot of folks think that just because you’re doing that, you’re going to take away their job and give it to developers. But developers aren’t security folks. They’re not networking folks. They’re not infrastructure folks doesn’t mean they couldn’t do it, and they’re not smart enough for it or be able to handle it. But that’s not their job. They’re developers. So making sure that everybody knows that DevOps isn’t just tooling. It’s people and process. You still need all of this stuff to work together.

[00:06:20.760] – Tim
You’re just doing it in a faster, more cohesive process.

[00:06:27.070] – Ned
Right. So when we’re talking about the pitfalls of IAC, I’m sure there’s some technical challenges in there, but would you categorize a lot of them as culture and communication challenges?

[00:06:38.350] – Tim
There’s a lot of that. There really is from a development perspective, having to sit back and do a little more design before you actually start putting your hands on the keyboard and writing code. That’s not something that they generally do. They just start writing stuff, see what breaks, recompile it and do the thing. Infrastructure folks are used to doing design and stuff like that. But it takes time, which they now don’t have. They need to speed up a little bit. It’s a lot of people in process.

[00:07:06.940] – Tim
Devops has always been people in process. You can’t just shove a CI CD pipeline in there and call it DevOps, so you have to fix the head and the people along with the tooling and everything like that. It’s just as important, if not more important.

[00:07:23.110] – Ethan
Well, let’s put the conversation into tools. Tim, this is a huge field, a lot of specialties, a lot of niches you can get into. How do I pick the right framework for my infrastructure as code adventure? I know you’re going to say it depends, right? But ultimately you’re going to get the TerraForm, so just tell us why your answer is TerraForm. Tim.

[00:07:45.890] – Tim
The answer is TerraForm simply because it’s the de facto industry standard. Community support is there. The providers are there. I hate saying that, but that’s just kind of how it is, but also ballooning is awesome, and it’s doing really cool things, and they’re really starting to catch up very quickly. Choosing a framework is by far the first and most important step of the whole thing. Making sure that you’re doing the right thing. Do you choose a multi cloud or cloud agnostic framework like your TerraForm jipalumes your Terraturonts, or do you choose a cloud specific framework?

[00:08:21.610] – Tim
Like if you’re all in on AWS, you probably are already or have already used CloudFormation in some way. If you’re on Azure. You’re using Arm templates in some way, I’m sure. But it’s one of those things. Where are you always going to be in AWS? I kind of touched on this in the talk a little bit. I’ve had customers say we’re in AWS, we’ll never not be in AWS. It’s like, okay, what happens when you’re acquired 18 months from now and they’re an Azure company?

[00:08:46.010] – Ethan

[00:08:46.650] – Tim
It’s not going to happen. You don’t know that. You never know. That. Great. If you want to use a cloud specific framework and it works for you and it’s working for you. Cool. That just may cause a little bit more work for you if for some reason, you have to Port that somewhere else. Now again, the caveat with that is if you choose something like a TerraForm that is kind of cloud agnostic. If you will, you can’t just take your AWS stuff and deploy it over to a Azure.

[00:09:10.920] – Tim
It doesn’t work that way, but it will give you a little bit better starting point to kind of convert the provider and do the thing.

[00:09:17.290] – Ethan
But even if I am in AWS and I have that mindset you were relating, I’m never not going to be in AWS. Are there still advantages to using TerraForm over cloud formation?

[00:09:27.290] – Tim
Yeah, absolutely. Because it’s not just the cloud resources that you’re deploying. What if you have some SaaS tool sets and stuff like that that you’re working with? There’s a lot of TerraForm providers for that. It allows you to do a whole lot more than just deploy instances to the cloud and things like that. So it kind of gives you a little more flexibility in your overall job, not just deploying that one thing over and over again.

[00:09:52.310] – Ned
Right. If I could jump in there with an actual example in my life of exactly that, I was doing deployment for Google Cloud, and I wanted to use MongoDB database, and at the time, Google Cloud did not have a native MongoDB compatible solution. I believe. So what I ended up doing was using a SaaS provider that creates Mongo databases for you in your cloud of choice. But I could still use TerraForm to deploy the entire thing and just use the proper providers for each component. And yeah, maybe if Google supported it down the line, I could just swap that out.

[00:10:27.740] – Ned
But I would still be using the same tool set and the same syntax and all the stuff that I’ve learned about using TerraForm.

[00:10:35.130] – Tim
And it’s one of those things where it’s kind of a Gray area. So I mean, with DevOps, you kind of get the ability to break away from that old school enterprise methodology of buy the giant suite that does all the things because it’s easier that way with DevOps, you can string together all of the best of breed little tools with duct tape and baling wire and make it work best for you. But sometimes you don’t have to choose the other tool just because it’s available. Like for a framework, why not use the same framework for multiple things?

[00:11:04.720] – Tim
Even if it’s not exactly the best, it may save you a little time in managing that code or managing that process versus just using that tool that might be marginally better?

[00:11:15.890] – Ned
Well, one of the things that I think people really liked about these enterprise spanning tool sets was the fact that it was all the same vendor. You had one throat to choke, one hand to shake however you want. It usually more on the choke side, especially knowing that you have that one vendor you’re working with makes things usually simpler from a licensing perspective. And there’s going to be some level of consistency between the tooling. Is that something that you’re considering when you’re selecting tools? Is that consistency across tooling?

[00:11:51.050] – Tim
Yes and no. So it’s just one of those things. Where is the consistency across tooling worth any value or time that you save or anything like that versus using two different tools together to do the same thing? So you just kind of have to weigh what’s better? Is it easier for me to use this one tool even though I can’t do this one or two different things, or I have to do a little bit differently that way at the end of the day, are you able to do what you need to do?

[00:12:20.170] – Tim
Are you able to manage it? Well, are you able to do it quickly? If so, then great. Use what you’re using. So again, it depends.

[00:12:33.050] – Ned
If I’m getting started in this world and I’m working in an organization, I need to ask some questions about the organization itself and not necessarily just pick the most popular tool or the thing that I heard about it all day. Devops talk.

[00:12:47.090] – Tim
Exactly right. That goes back to that people in process thing. Just because the tooling is better doesn’t mean that the people and process are where they need to be in order to adopt and utilize that tool. If you’re at an organization that is a giant enterprise, they Azure used to using giant enterprise vendors, and they’ve got all these license agreements and contracts. And what have you you’re probably going to have a harder time adopting that tiny little open source tool versus going to this one company and getting it folded into your Ela.

[00:13:23.550] – Tim
Yeah. You definitely need to kind of figure out where you’re at, where the organization at, where the people and processes at before you can start figuring out what the tooling is that you can utilize for that.

[00:13:32.970] – Ned
Right. There’s a benefit and cost analysis you got to do. Is the benefit of the small open source tool worth the effort I’m going to have to put in to get some kind of adoption in my large enterprise.

[00:13:44.770] – Tim
Right. And also is it supportable enough for an organization of this size?

[00:13:51.150] – Ethan
So, Tim, talk to me about my hybrid cloud world and choosing a framework. That is, I’ve got Linux host onprem, let’s say, and some mix of public cloud et cetera. Does that mean I end up with a mix of tools more maybe like a best to breed approach, or is it better if I really focus on a single framework that’s going to make me operationally sane?

[00:14:12.570] – Tim
And at that point in time, I would stick with the single framework just simply because that way you only have one little tool set to manage, as opposed to I have this tool over here for this thing. I have this tool over here for this thing because then you end up where you’ve got one guy or one Gal or one person. I’m sorry on that team that manages that one thing. And what happens if they move on, move up and do the thing you’re left out with that person.

[00:14:38.870] – Tim
If you have a whole team that’s managing this one framework across all of this hybrid infrastructure, it’s a little bit easier for you to kind of transition people around, move things, deal with loss. So I think it’s better to go agnostic in something like that. If you are trying to mix a bunch of different sets.

[00:14:59.590] – Ethan
Though, it’s almost always a trade off where there’s some shortcoming in that one size fits all tool. Are we at a place with TerraForm? You mentioned Polumi, where it’s going to be good enough for most of my use cases.

[00:15:13.340] – Tim
See, and that’s the thing with TerraForm and Polumi being open source and with how strong their communities are. Generally, if that little feature is missing, somebody is immediately going to file an issue on GitHub and say, hey, this is missing, or they have the ability to go in themselves and open up a poll request on that provider to make that update. So while there is a little bit of shortcomings, sometimes with that, usually with very well used community frameworks like that, that issue is going to be taken care of and fixed unless it’s some tiny little obscure feature.

[00:15:48.100] – Tim
And you’re one of the only, like, 20 people that use it, at which point that may be something that if that feature is that important to you, maybe you use the specific framework for that Tim.

[00:15:59.570] – Ethan
Another feature that maybe helps me pick a framework. Or maybe it doesn’t actually. And that’s the question is rollback. If I want to roll back the thing, that all just blew up and makes me sad because my choice of framework get impacted. Version control is a major thing I’m looking for is that just basically that’s a table stakes feature, and it kind of doesn’t matter which framework I pick a lot of times with the frameworks these days.

[00:16:20.030] – Tim
It’s not the actual framework itself for the rollback, it’s your process and methodology for deploying said framework. I mean, if you’re doing it on your laptop, cool if you’re using some kind of automation, like if you’re using GitHub actions or CSV pipeline or something like that. With more of a get Ops methodology, you may have a little bit easier time just going back through and reverting that commit. So really, it’s not necessarily the framework as much as the process around the framework.

[00:16:50.110] – Ned
Interesting, I would even argue that there is no such thing as a true rollback. You can only roll forward. Right. And I think Git supports that you don’t really reverse a commit. You just add a new commit that takes you back to where you were, right?

[00:17:05.950] – Tim
Yeah. In your head you’re thinking about. Oh, I’m reversing that commit, but really, you’re just opening up a new PR and you’re merging the next numbered PR that just happens to revert that change. It’s kind of splitting hairs in your head over process. But technically, it’s kind of a rollback, but as far as the code and the deployment and the framework is concerned, you’re just deploying a new change, right?

[00:17:29.850] – Ned
It’s semantically important to me just because there’s a consequence to rolling back versus rolling forward to something that’s properly working, especially when we’re thinking about data preservation and not losing anything that has changed since the last roll out. But we can move on from that. That’s my own personal.

[00:17:52.090] – Ethan
Question for you on this rollback thing you mentioned, like GitHub actions, which begs the question about SaaS based tooling versus running it myself on my own server, my own TerraForm and such. Is there enough of a risk of relying on cloud based tooling to run my infrastructure that I should really be like, what if GitHub actions are having a bad day and I don’t want to have to rely on them? Or is that so sporadic? Just use it and quit being a baby.

[00:18:22.930] – Tim
The value that you gain from utilizing some kind of centralized process, whether it be GitHub actions, a CICD pipeline and infrastructure AKS code specific automation platform that there may be out there. The value that you gain from utilizing some kind of centralized platform for that far outweighs the risk of that platform, maybe having some kind of work that day, things like role based access control, the visibility alone, centralized state management. I mean, all of this stuff that you gain from a platform like that, it just for far outweighs any risks like that.

[00:19:08.100] – Ethan
So you’re telling me to quit being a baby? Got it? Okay.

[00:19:10.440] – Ned

[00:19:11.710] – Tim
100%. Don’t get me wrong. US east one goes down sometimes and stuff happens like it is what it is. We know how that works.

[00:19:22.330] – Ned
But all of Google Cloud goes down for some strange reason.

[00:19:26.770] – Tim
Exactly. It’s just one of those things where you can make your design choices. You can try to mitigate those risks as much as humanly possible. But, yeah, centralizing your infrastructure AKS code deployments is well worth the risk.

[00:19:41.990] – Ned
Absolutely. One of the things we talked about earlier is sort of the merging of infrastructure vROps roles and Dev roles. What about merging the code between those two? Should I be rolling out my infrastructure tightly integrated with the app, roll out, or do I want to keep those relatively separate?

[00:20:04.010] – Tim
It’s 100% up to people and process. If we just take the code, there are so many different methodologies for managing code. There’s multi repository, there’s mono repository. You can break things up into modules versus not there Azure. So many different ways to do that. The same kind of thinking applies to doing infrastructure as code. Do you keep all of your stuff in a single repository? Do you keep your infrastructure and your app code in separate repositories? It all depends. I’ve seen some folks that keep everything for a specific app inside of one repository.

[00:20:45.940] – Tim
They have their subfolder that goes down to their code structure. They have their subfolder that goes down to their infrastructure structure. So it all depends. And that’s another one of those things that you have to kind of look out for. If you’ve got a developer Department right now that keeps that kind of stuff separate, then keep their heads straight. Don’t just start throwing them a curveball and shove a bunch of infrastructure stuff into their repositories and mess things up. Just keep that separate and just say, hey, we’ve got another repository.

[00:21:18.290] – Tim
We’ve got a completely separate pipeline and all this kind of stuff set up for it. So you’re just basically minimizing the amount of changes that you have to do to accomplish the same thing.

[00:21:30.690] – Ethan
It’s going to depend on the infrastructure platform, too. If you throw it up onto a Kubernetes cluster, who cares? You can kind of maintain the Kubernetes cluster is a separate thing, throw your workload on there, and they really can be easily separate because the Kubernetes cluster is like it’s an amorphous cloud of it just does my compute for me. Hooray magic. But if you’re standing up a unique container or VM or something to run your process, I would think you would want those to be more integrated because it’s a tighter relationship.

[00:22:01.470] – Tim
So with the way let’s think about the methodology of deploying something. Let’s kind of break that down. So you have your infrastructure as code somewhere. You have your application code somewhere. You have a CI CD pipeline or some kind of automation tool that’s going to pull down and clone that repository. It’s going to run your TerraForm plan, TerraForm apply pulling me up or what have you and it’s going to deploy that infrastructure after that. Once that’s done, it’s got its outputs for what to connect to. It’s going to connect to that and it’s going to go through and it’s going to deploy your application code.

[00:22:32.850] – Tim
Does it really matter where that platform is pulling from either one repository or two repositories at the end of the day? Or is that just kind of a sanity decision for whoever has to manage this kind of thing? And it’s just making it easier on them.

[00:22:47.510] – Ethan
And you can argue that they have a separation of duties, maybe for security purposes. Let’s say it has some value.

[00:22:52.460] – Tim
Exactly right. Like does your infrastructure operations team need access to the app code? Does the app team need access to the infrastructure code? Is there a joint platform that they’re using that has shared secrets? Which means does the App Dev team have access to the cloud deployment secrets? There’s so many different things that kind of go into that, and it’s all up to the people and process to decide what the best design choices are for that. Because at the end of the day, you have some kind of tool that’s deploying one and then deploying another.

[00:23:25.480] – Tim
And it really doesn’t matter where it’s pulling from.

[00:23:28.440] – Ned
Right to a certain degree, you’re invoking Conway’s law, which is the structure of your organization, is going to inform the structure of your code and the way that your applications look. And now that is inclusive of infrastructure, whatever workflow you had before, where you were doing everything manually. So, hey, the infrastructure team, they’re going to order servers, install operating systems and get them primed and ready for application deployment, and then hand it over to the app team. If that was the workflow you had, you’re probably going to replicate that exact same workflow.

[00:23:58.260] – Tim
Exactly right.

[00:24:01.890] – Ethan
Let’s talk about that, Tim, because it’s everyone’s job always. I’ve been in those works where security is everybody’s job just do your job, which means it’s no one’s job. So that doesn’t work in IAC, because you’ve got to Bake security and have a proper security model. Security posture when you’re done deploying, that’s going to be all baked right in. So how do you get that done? Who owns security in an infrastructure as code world? What’s the process of deciding, composing, testing, implementing, assuming what policy should look like?

[00:24:33.360] – Ethan
How does that all get handled in an infrastructure as code approach?

[00:24:37.470] – Tim
So at the end of the day, it’s exactly the same as if you’re just doing infrastructure and code. I mean, it all comes down to whoever is in charge of security posture, who designs the policy. Who does that now? That doesn’t mean that they’re the ones that are writing the security as code tests. It doesn’t mean they’re the ones that are writing the open policy agent policy as code framework, stuff for compliance. It’s just everybody needs to be involved. And we use the term shift left. I know some people are starting to hate that.

[00:25:06.310] – Tim
But really, when you’re taking the security old school enterprise infrastructure. When I was doing systems administration and then architecture and stuff like that, developers just did whatever they wanted in their own little sandboxes. And then security wasn’t involved at all until it was time to turn that over to production, at which point all of the firewall requests and all of the things came in. Infrastructure code is a little different than that. You are constantly building, testing and deploying and doing all of these things. So security needs to be involved from day one.

[00:25:36.610] – Tim
Like, as soon as you’re building a testing, you need to make sure that security is in there ready to go. So security needs to be involved to tell you this is what the policy should look like, or if you don’t have a security team and it’s everybody’s problem. Whoever is designing that when it gets pushed to production later, needs to be the one that’s like, hey, these are what we’re going to need. Let’s test this now before it deploys to production.

[00:25:59.860] – Ethan
I mean, that sentiment is no different than it was ten or 15 years ago. We got into the conference room right now to for the big project to start telling us about policy, and it never seemed. I don’t know. It was more like what you’re describing before, where it’s like, okay, time to go to production. Let’s get those firewall requests in, et cetera, et cetera. But this to me, is a different animal than that, practically speaking, because it is you’ve got different attack surfaces. You’ve got different ways you can screw up if security isn’t involved from the beginning to make sure you’re crossing your T’s and dotting the I’s.

[00:26:35.130] – Tim
There’s a lot of companies out there that are actually kind of doing the job of security without fully doing the job of security. For instance, if you take some security as code frameworks, you have TerraForm scan by actor, you have checkoff by bridge crew. There’s so many different securities code frameworks out there, and a lot of those have a bunch of policies and stuff like that baked into them just based off general cloud best practice. Like, you don’t want to deploy an S three bucket to the public with anybody can access everything by default just in case.

[00:27:05.010] – Tim
So they have all of these different checks in there to say, hey, you might want to look at this, but they also have the ability to add that to the accepted risk category or something. Where you say it’s fine, just glass over that. We’re cool. Then you step to the other side of that aisle where you have something like an open policy agent. Or if you’re using TerraForm cloud, they’ve got Sentinel policy where that is purpose written policy as code. Where you’re saying I do not want any deployments to go to any regions except for these two.

[00:27:35.480] – Tim
Or I only allow certain sizes of instances. Or I do not want any open Sider blocks inside of my VPC. You literally custom write that code to say this is what my compliance framework looks like. Anything outside of that is a failure. So you kind of have the best of both worlds. I see customers that are kind of on the bleeding edge of this using both. They’re using something like an accurate TerraForm or what have you TF SEC is the one I was thinking of earlier, and they’re using this with all the default values of, hey, these are cloud best practices.

[00:28:15.760] – Tim
We’re making sure we’re staying in that, and then they’re utilizing, hey, these are our more specific compliance framework that we’re using to make sure that we’re staying inside of our compliance needs for some kind of maybe we’re PCI compliant. We have to kind of keep a little bubble around it or what have you. So it’s kind of a two step thing for that. But if it’s just a small company securities everybody’s problem, then a lot of these companies out there are making it a lot easier just by saying, here’s your default rule set based on cloud best practice.

[00:28:47.710] – Tim
Go for it.

[00:28:48.350] – Ned
I think what’s really interesting is our conversation is moving manual processes to automation to a certain degree. That’s what we’re talking about. Part of what infrastructure is code is meant to do. Security in the past was a very manual thing. They would write policies on, like, and then you would do a security review with the security team, and they would like, pick apart your deployed infrastructure and tell you all the things that were wrong. Now we want to automate that process. And I think what you’re talking about is almost security by policy, and these policies being things that can analyze code.

[00:29:24.690] – Ned
One of the things you mentioned your talk was open Policy agents. Is that one of the ways that you’re seeing organizations apply security as a policy?

[00:29:33.570] – Tim
Yes, open Policy agent is absolutely different than some of the other securities code frameworks where they’re just kind of checking against default stuff. Open Policy agent. You write your own stuff from scratch, they have their own coding language that they use and you write your policy. And essentially, you’re just putting the guardrails on. You’re saying it either has to perfectly meet this criteria or anything that fits within this criteria or anything other than this criteria. It kind of depends on how you read it, but it just allows you to kind of put those guardrails on and say, as long as it fits between this and this, you’re off to the races, at which point in time, security doesn’t have to be involved.

[00:30:13.540] – Tim
Every single time there’s an update, every single time there’s a deployment. It just says, look, as long as this script is being matched against every time, good, you can deploy 50 times a day if you want.

[00:30:24.460] – Ned
Right. The tools you mentioned earlier, Tsk, Terra Scan, et cetera. They all seem to be very TerraForm centric is open policy agent like that, or is it have more of a broad room at Open Policy Agent is yes.

[00:30:42.150] – Tim
It is used a lot in TerraForm, but there’s a lot of different things that you can do. You can essentially write a provider for it to say, hey, I’m going to parse this TerraForm plan, or I can just look into this cloud. So there’s lots of different ways you can utilize that. There are different companies out there that are essentially using Open Policy Agent to actively monitor cloud resources that are deployed like into said VPC or things like that. So there’s lots of different ways to utilize that.

[00:31:11.650] – Tim
If you want to use it with polymer, you can. That’s one of the great things about Open Policy Agent is it is open source. People are always making updates to it. They’re adding more providers. They’re writing more things for it. Aks opposed to, like Sentinel from HashiCorp, which is locked into TerraForm Cloud. It does give you the freedom to kind of do what you want and what you need from it.

[00:31:31.990] – Ned
Okay, so that’s a much more flexible platform, but with that flexibility means you also have to do a lot of customization on your own.

[00:31:40.260] – Tim
Yeah, it’s daunting because you get Open Policy agent, you get the binaries and you get nothing like it’s not going to do default checks for you. If you want to figure out how do I make sure that I’m only deploying three nodes per Kubernetes cluster. Max, you’ve got to go out and find somebody who already wrote it, or you have to write it from scratch yourself. So it’s definitely kind of a different beast, as opposed to some of the other static code analysis tools, but it is much more powerful than what you can do with it.

[00:32:07.160] – Ethan
I have not worked with OPA at all. Is there a bunch of regex in there, too? It feels like it’s going to be a bunch of regax.

[00:32:14.490] – Tim
They use the Rego language, and it’s to me at least. And I’m not a developer or what have you, but it’s very unintuitive. It’s very awful. I know there Azure, some folks that just live by it and love it and they’re fine with it. But that’s like when you find a Cobalt developer and they say it’s the greatest thing ever. I don’t know if it’s Stockholm syndrome or what, but it’s difficult for a lot of people, but luckily it’s been around for a while and there’s a lot of different examples out there.

[00:32:41.050] – Tim
So most likely somebody has already had the same problem that you have, and you’ll have something that you can go and copy from Stack overflow to give you a good starting point.

[00:32:50.790] – Ned
As someone who finds JavaScript completely inscrutable, like, my brain doesn’t work that way. I think some languages just fit better with certain people. For me, JavaScript is my that’s my Achilles C sharp is what broke my brain.

[00:33:07.920] – Tim
And I realized, no, this is not for me. I can’t do it like I was a Visual Basic and a VB dot net guy for a long time, and I’m like, oh, hey, C sharp is included with Visual Studio now, so I pulled it up and I started doing a little bit. I’m like, no, not a developer. Okay, I tried.

[00:33:25.270] – Ethan
Since we’re talking about security. Tim I want to talk about false positives that these scanning tools might bring up because my background with network infrastructure and security. I’ve run a bunch of production IDs IPS systems and false positives are like that’s the beauty of your existence getting past them. Since the system is actually useful to you. Do we have a similar problem here?

[00:33:43.790] – Tim
You have entire people whose job it is to tune the SIM system and say, this is fine. Yeah, I totally understand. And it is kind of the same thing. There are a lot of things that kind of flag that say, hey, this is bad. Hey, this is not. But one thing is that a lot of these tools give you the ability to either accept that risk or to add that to an exclusion list or something like that. There Azure, some systems that do fail where it’ll hard fail on one of those issues, even if it’s something that you accept.

[00:34:18.560] – Tim
So there are some frameworks out there. Cloud rails specifically will soft fail by default, which is really cool. So if you have this giant system out there that’s already running, it’s already in production. If you say insert Terra scan into your deployment process, it’s going to break your deployment process because it will hard fail certain rules. It will throw an exit code one, and your whole thing will be shot. If you say insert cloud rails in there, it’ll soft fail. It will still go. And you’ll know, hey, I need to either fix this problem or add it to an exclusion list so that’s one of those things that you need to look out for is does it have a hard fail or a soft fail?

[00:34:56.090] – Tim
Because if you’re implementing this kind of tool after your process is already kind of in place and stuff is deployed and stuff is running and it’s in production and you need it. You need to know if putting this in there is going to be a breaking change.

[00:35:09.190] – Ethan
Another pitfall you mentioned in your talk, Tim was. Well, you saw that with this concept of Dry D-R-Y. The acronym would you explain the pitfall and then what Dry is all about and how that makes my life better?

[00:35:21.940] – Tim
Yes. So this is a pitfall from the coding side of the house versus the infrastructure side of the house. Dry is a methodology from programming development called Don’t Repeat yourself, and it’s essentially creating smaller, more repeatable code with infrastructure. Aks code. You can actually, if you have let’s say you have your infrastructure and you have everything built, but you have multiple versions. You have production, you have staging, you have testing, you have QA, you have all the things. Well, if you build your code with all of these values written into there, like hard coded values.

[00:35:57.970] – Tim
This is the name of my database. This is the IP range. This is the VPC it connects to. Then you can’t use that code again. For staging or for any other environments, you have to have multiple copies of the same code with the dry methodology you essentially are writing, you can go down, and it gives you the ability to say, instead of putting in that name, make that a variable call and then inject that variable in during deployment, saying, hey, since this is my production deployment, this is the production name for that or what have you.

[00:36:29.560] – Tim
And that way you’re only managing one set of code for all of these different environments, because if you have five different sets of code for the same build in different environments, if you have to update that code, you have to go and update five copies of that same code. If that’s how you want to do it. Well, using dry, it will kind of allow you to build that methodology down a little more so that you only have one set of code to do. And you can actually take that a step farther.

[00:36:54.770] – Tim
With things like TerraForm modules or creating polymer modules, you’re actually breaking your code up even farther into more manageable bits, like instead of having your VPC and your EC, two instances and your RDS instances and your Im policies and everything all in the same files, you break those up into modules. So that say, I know I’m going to use my VPC across 20 different environments. Great. Manage one VPC code and use it on those 20 environments. So it just allows you to have little smaller chunks that you can reuse.

[00:37:31.390] – Ethan
Which can be a change of thinking for an infrastructure professional that’s used to doing some long linear process to deploy some complex set of infrastructure, they have a playbook, or they have some process they follow. Okay, now I got to break this all up modularity, so I can snap together a process that’s more like Lego bricks. And then the next time I’ve got to fix the thing, I’ve only got to fix the thing in this one place, and then the eight other places that call that one place are already fixed for me.

[00:37:58.350] – Tim
Yeah, it’s a different thinking from an infrastructure perspective, but it’s also at a place where a lot of infrastructure folks were trying to get way back. When I was an engineer, I was working with customers. I was a VMware infrastructure engineer specifically, and we had we realized automation, and we were trying to build reusable templates of the deployment architecture so that we could essentially just say, hey, we need a new development environment. Hit the button that spins everything out, and you’re good to go. It’s the same kind of thing.

[00:38:33.480] – Tim
You’re trying to create a more repeatable process. But instead of creating these giant templates or what have you you’re just creating smaller bits of code that you can spin out when you need it.

[00:38:45.240] – Ethan
Do I have to dry my code by hand, or is there a tool out there’s? Always a tool? Is there some tool or tools out there that will help drive the code for me.

[00:38:53.870] – Tim
Yeah. Usually there’s some tools out there. They may not be perfect. They may miss some things. They may screw some things up. So it’ll take a little bit of manual intervention. A lot of folks that are starting from scratch. It’s easier just to kind of write dry code from the beginning. Yeah, of course. If you have code out there, of course, there are some tools out there. I have not personally used one for that, but things like TerraForm kind of make it a little easy to pull certain things out and make variables and then make a variable file out of that.

[00:39:28.440] – Tim
So I’ve seen it mostly just done by hand, but it all depends.

[00:39:33.010] – Ned
I want to drill into another aspect of IAC, which is kind of the amount of code that I should have stored in a particular state data, whether I’m using TerraForm or something else. What is the optimal size for a single deployment? Because the larger my deployment gets, the more that has to be checked and changed every time I make a small change, the more that has to be tested, and the more that could potentially break if something goes a little sideways. Is there a pattern or an anti pattern that you see along those lines?

[00:40:09.430] – Tim
Yeah. This goes back to that dry methodology. Again, if you’re breaking up your code into these more repeatable bits, you’re also able to then break your state files up if you need to update your EC two instances from two to three. But all of your stuff is in one giant state file. It’s going to have to check every single one of those things. It’s going to take forever every single time. If you have your stuff broken up into those reusable bits or modules or what have you, then it’s only going to have to check what’s in that module.

[00:40:39.410] – Tim
Now you may need to work with your automation platform to chain that together, because every time you redeploy your EC two instances, you need to redeploy your VPC or what have you, but it’ll make those processes a lot faster instead of having to check a whole bunch of stuff that doesn’t need to be checked every single time.

[00:40:56.390] – Ned
So to a certain degree, I’m thinking personally, I think of it kind of in layers. So what is in each layer that absolutely has to go together for it to make cohesive sense. And then what’s the next layer that’s going to use the layer below it? So I think you sort of alluded to it. I’ve got my network layer where I’m laying down VPC and stuff like that. And then my next layer might be the EC two instances that’s going to run my application. Those two layers need to interact, but they don’t need to be part of the same state file or even necessarily the same folder of code.

[00:41:25.670] – Tim
Exactly right. And it’s one of those things where if your VPC is very static and it doesn’t change very often, you don’t want to be having to mess with it all the time. You may want to just shove it off in its own folder, deploy it out there and be done with it for the next six months, because your EC two instances may change a lot or your RDS instances or what have you may change a lot, but it’s just one of those things where if you kind of break that off into its own piece, you put it where it needs to go.

[00:41:50.540] – Tim
That’s one less thing that you have to worry about every single time you’re making deployments or changes or what have you right.

[00:41:56.500] – Ned
And with every deployment you’re going to be doing testing, maybe we should end on that. What is your approach and advice when it comes to testing infrastructure? Aks code? Because I know there’s a lot of philosophies out there. It’s a hard thing to do. So how do you do it? Yeah.

[00:42:13.730] – Tim
There’s lots of different ways you can utilize this. There are tools out there that will help you do build tests to make sure that your code is written correctly, just like any other code. There are renters out there that will make sure that your syntax is correct, that everything is written the way it should. Then when you’re using your security as code tools to make sure that you’re not deploying an open S three bucket or something like that. But just using the regular development methodologies of hey, we’re going to deploy this offline.

[00:42:44.880] – Tim
We’re going to do a little bit of testing. We’re going to make sure it doesn’t break or work. And then going through deploying that’s traditional deployment software development lifecycle stuff that doesn’t change just because it’s infrastructure as code doesn’t mean that you don’t need to deploy an instance of this out, do a little bit of QA, make sure it’s working before you actually convert that over to production.

[00:43:04.980] – Ned
What degree can I trust the APIs to actually deploy the thing that I asked for? And when should I verify that it actually did what I intended?

[00:43:13.980] – Tim
Yeah. Regular development methodology. When you deploy or when you create code and you’re building stuff, you have unit tests and stuff like that inside of the deployment to make sure that that’s working. It’s exactly the same with infrastructure. Aks code, deploying stuff out, making sure that you’ve got some kind of verification in there before you actually turn that over.

[00:43:33.810] – Ethan
Well, Tim, this has been an outstanding conversation. I’m glad that we had to have it. We got to have it because I heard you at all. They have here’s a brain dump in 20 minutes of all of these things, and we got to flesh this out more in the show.

[00:43:47.870] – Tim
I really appreciate you all having me and giving me the opportunity to talk about it a little more. Sometimes with these virtual conferences, it absolutely is a whole lot of information. It’s even worse when it’s a ten minute lightning talk and it’s just like, Wait, what? I definitely appreciate it.

[00:44:04.140] – Ethan
Tim, how do people follow you on the Internet?

[00:44:05.870] – Tim
Yes, the best way to get me is on Twitter. I am at v Tim D on Twitter. You could have my cell phone and my email, but I will still probably answer you on Twitter. Faster.

[00:44:13.860] – Ethan
Very good. V Tim D on Twitter. Much appreciated. And again, thanks, Tim, for your time and thanks to you out there for listening. Virtual high fives to you for tuning in. If you have suggestions for future shows, people you want Ned and I to interview topics you want us to cover. We would love to hear all of your ideas hit either of us up on Twitter. We’re at day two. Cloud show or fill out the form of Ned’s fancy website nedandhecloud. Com a little housekeeping for you.

[00:44:39.330] – Ethan
Did you know that you don’t have to scream into the technology void alone? The Packet Pushers Podcast network has a free slack group that is open to everybody. Visit PacketPushers. Net. Slackandjoin that Slack Group is a marketing free zone for engineers to chat and compare notes and tell war stories and solve problems together. Again, packet pushers. Net slack. Nostrings attached AKS at all. Until then, just remember, Cloud is what happens while it is making other plans.

More from this show

Episode 127